Edelweiss Real Estate Housing Project: Data Analysis & Modelling (Part 1)

1. Data Preprocessing

In [1]:
import pandas as pd
import numpy as np

df=pd.read_csv('edw_launch_price_data.csv')
df.head()
Out[1]:
pid b1 b2 b3 b4 b5 b6 b7 b8 zone ... sp_interior sp_switches sp_windows sp_wiring sp_exterior sp_frame sp_points sp_lobby sp_cement wap
0 1 106.0 5.0 NaN NaN NaN NaN NaN NaN Zone A ... NaN NaN NaN NaN NaN NaN NaN NaN NaN 14464.0000
1 2 NaN 98.0 52.0 NaN NaN NaN NaN NaN Zone A ... NaN NaN UPVC Windows with Granite Sills NaN NaN NaN NaN NaN NaN 13982.0000
2 3 NaN 98.0 52.0 NaN NaN NaN NaN NaN Zone A ... NaN NaN UPVC Windows with Granite Sills NaN NaN NaN NaN NaN NaN 13982.0000
3 6 107.0 13.0 NaN NaN NaN NaN NaN NaN Zone A ... NaN NaN NaN NaN NaN NaN NaN NaN NaN 12857.0000
4 7 1.0 22.0 10.0 NaN NaN NaN NaN NaN Zone A ... NaN NaN French Windows NaN NaN NaN NaN NaN NaN 16981.5758

5 rows × 112 columns

1.1 Creation of micro_market number and developer_number columns

In [2]:
df['micro_market_number']=df['micro_market'].str.extract('(\d+)')
df['developer_number']=df['developer'].str.extract('(\d+)')
df['project_number']=df['project'].str.extract('(\d+)')

1.2 Extract launch year from launch date

In [3]:
df['launch_year'] = pd.DatetimeIndex(df['launch_date']).year

1.3 Log-transformation of wap into log_wap

In [4]:
df['log_wap'] = np.log(df['wap'])

1.4 Recoding of proj_bedrooms

In [5]:
df['proj_bedrooms'].value_counts()
Out[5]:
1,2              1075
1                 519
1,2,3             296
2,3               241
2                 161
3                  56
2,3,4              51
3,4                39
1,2,3,4            33
3,2                17
4                  16
3,1,2              14
1,3                12
3,4,5              10
1,2,4               9
1,2,3,5             8
1,3,4               6
2,3,5               6
2,4                 6
2,3,4,5             5
3,5                 4
3,4,2               3
4,2,3               3
4,3                 3
4,5                 3
1,2,3,4,5           3
1,3,4,5             2
2,3,8               1
4,5,6,3             1
4,2                 1
2,5                 1
6,7,1,2,3,4,5       1
3,4,6               1
5,2,3,4             1
1,2,4,5             1
2,3,4,5,1           1
2,3,4,6             1
4,5,8               1
4,1,3               1
2,3,4,1             1
3,6                 1
4,1,2,3             1
6,2,3,4,5           1
3,1                 1
6,1,2,3,4           1
1,5,8               1
2,3,5,1             1
Name: proj_bedrooms, dtype: int64
In [6]:
new_tags=['1,2,3','1,2,3,4','1,2,3,4','1,3','2,3','2,4','1,3,4','1,2,3,4,5','2,3,4,5','3,4','1,2,3,5','2,3,4','2,3,4']
old_tags=['3,1,2','4,1,2,3','2,3,4,1','3,1','3,2','4,2','4,1,3','2,3,4,5,1','5,2,3,4','4,3','2,3,5,1','4,2,3','3,4,2']

df["proj_bedrooms_clean"]=df["proj_bedrooms"]
df["proj_bedrooms_clean"]=df["proj_bedrooms_clean"].replace(old_tags,new_tags)
df["proj_bedrooms_clean"]=df["proj_bedrooms_clean"].str.split(',').str[0]
df["proj_bedrooms_clean"]=df["proj_bedrooms_clean"].replace(['1','2','3','4','5','6'],['1 and more','2 and more','3 and more','4 and more','5 and more','6 and more'])
df["proj_bedrooms_clean"].value_counts()
Out[6]:
1 and more    1985
2 and more     498
3 and more     114
4 and more      21
6 and more       3
Name: proj_bedrooms_clean, dtype: int64
In [7]:
cols=['zone', 'micro_market_number', 'developer_number','project_number','wap','log_wap','proj_bedrooms_clean','unit_type','proj_launched_units','launch_year','construction_status','max_size','min_size','b1','b2','b3','b4','b5','b6','b7','b8']
df1=pd.DataFrame(df[cols]).fillna("0")
df1.head()
Out[7]:
zone micro_market_number developer_number project_number wap log_wap proj_bedrooms_clean unit_type proj_launched_units launch_year ... max_size min_size b1 b2 b3 b4 b5 b6 b7 b8
0 Zone A 1 1 1 14464.0000 9.579418 1 and more Apartment 111 2015 ... 744.75 312.48 106 5 0 0 0 0 0 0
1 Zone A 1 2 2 13982.0000 9.545526 2 and more Apartment 150 2014 ... 1003.30 730.87 0 98 52 0 0 0 0 0
2 Zone A 1 2 3 13982.0000 9.545526 2 and more Apartment 150 2014 ... 1003.30 730.87 0 98 52 0 0 0 0 0
3 Zone A 1 5 6 12857.0000 9.461644 1 and more Apartment 120 2016 ... 590.51 269.96 107 13 0 0 0 0 0 0
4 Zone A 1 7 8 16981.5758 9.739884 1 and more Apartment 33 2015 ... 1717.00 681.57 1 22 10 0 0 0 0 0

5 rows × 21 columns

1.5 Type Conversion

In [8]:
df1.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2621 entries, 0 to 2620
Data columns (total 21 columns):
zone                   2621 non-null object
micro_market_number    2621 non-null object
developer_number       2621 non-null object
project_number         2621 non-null object
wap                    2621 non-null float64
log_wap                2621 non-null float64
proj_bedrooms_clean    2621 non-null object
unit_type              2621 non-null object
proj_launched_units    2621 non-null int64
launch_year            2621 non-null int64
construction_status    2621 non-null object
max_size               2621 non-null float64
min_size               2621 non-null float64
b1                     2621 non-null object
b2                     2621 non-null object
b3                     2621 non-null object
b4                     2621 non-null object
b5                     2621 non-null object
b6                     2621 non-null object
b7                     2621 non-null object
b8                     2621 non-null object
dtypes: float64(4), int64(2), object(15)
memory usage: 430.1+ KB
In [9]:
def type_converter (dataframe, col_name, data_type):
    dataframe[col_name]=dataframe[col_name].astype(data_type)
In [10]:
for i in list(df1.columns)[-8:]:
    type_converter (df1, i, 'int')
In [11]:
for i in ['zone','micro_market_number','proj_bedrooms_clean','project_number','unit_type','launch_year','construction_status']:
    type_converter (df1, i, 'category')
In [12]:
df1.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2621 entries, 0 to 2620
Data columns (total 21 columns):
zone                   2621 non-null category
micro_market_number    2621 non-null category
developer_number       2621 non-null object
project_number         2621 non-null category
wap                    2621 non-null float64
log_wap                2621 non-null float64
proj_bedrooms_clean    2621 non-null category
unit_type              2621 non-null category
proj_launched_units    2621 non-null int64
launch_year            2621 non-null category
construction_status    2621 non-null category
max_size               2621 non-null float64
min_size               2621 non-null float64
b1                     2621 non-null int32
b2                     2621 non-null int32
b3                     2621 non-null int32
b4                     2621 non-null int32
b5                     2621 non-null int32
b6                     2621 non-null int32
b7                     2621 non-null int32
b8                     2621 non-null int32
dtypes: category(7), float64(4), int32(8), int64(1), object(1)
memory usage: 330.9+ KB

Creation of bedrooms by typology columns

In [13]:
df1['bp1']=df1['b1']/df1['proj_launched_units']
df1['bp2']=df1['b2']/df1['proj_launched_units']
df1['bp3']=df1['b3']/df1['proj_launched_units']
df1['bp4']=df1['b4']/df1['proj_launched_units']
df1['bp5']=df1['b5']/df1['proj_launched_units']
df1['bp6']=df1['b6']/df1['proj_launched_units']
df1['bp7']=df1['b7']/df1['proj_launched_units']
df1['bp8']=df1['b8']/df1['proj_launched_units']
In [14]:
df1['b48']=df1['b4']+df1['b5']+df1['b6']+df1['b7']+df1['b8']

2. Exploratory Data Analysis

2.1 Clustering Analysis of Micromarkets

  1. K-Means Clustering based on mean of wap for every micromarket
  2. Assigning cluster numbers to respective micro_markets
  3. Counting number of projects and micromarkets in every cluster
In [15]:
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

sns.set(style="darkgrid")
sns.set_color_codes("dark")
In [16]:
mm_prices=df1.groupby('micro_market_number')['wap'].mean()
mm_prices
Out[16]:
micro_market_number
1      16070.897978
10     12709.734924
100     6730.000000
101     4800.454082
102     7601.348358
104     4550.000000
105     9666.666667
106     5561.967821
107     8028.775714
108    10951.648586
109     4281.954455
11     13484.938329
110     5600.885779
111    15000.000000
112     3487.000000
113     4788.333333
114    10016.946792
115    24283.307556
116    49078.461538
118    36898.200000
119    34324.333333
12     12687.445504
120    33074.574463
121    19744.017888
122    38020.914456
123    28022.157103
124    31466.083000
13     14725.692669
14      3217.645000
15      3230.666667
           ...     
69      3134.751107
7      14843.527256
70      2900.637781
72      4615.802286
73      4877.616544
75     17564.462999
76     17072.694050
77     18741.930000
79     17901.507612
8      16130.349347
80     20000.000000
81     50000.000000
82     24814.456483
84     31521.000000
85     31160.750000
86     24508.750000
87     26954.200000
88     29835.340446
89     30000.000000
9      15391.270927
90     27500.000000
91     21137.908680
92     27392.152940
93     42797.333333
94     41928.750000
95     10719.430572
96      5898.400000
97      4174.288250
98      9155.409091
99      5780.332080
Name: wap, Length: 113, dtype: float64

The number of micromarkets is too large to be included as predictor variable. Let's group/cluster micromarkets according to mean prices.

In [17]:
mm_prices.shape
Out[17]:
(113,)
In [18]:
mm=mm_prices.values
mm=mm.reshape(-1, 1)
mm.shape
Out[18]:
(113, 1)
In [19]:
from sklearn.cluster import KMeans

#clustering formula
kmeans = KMeans(n_clusters=20, n_init=100, max_iter=1000,n_jobs=5)

# get the clusters
kmm = kmeans.fit_predict(mm)

#clustering metric
from sklearn.metrics import silhouette_score
print("Silhoutte Score of Micromarket Clusters:",silhouette_score(mm, kmm))
Silhoutte Score of Micromarket Clusters: 0.6702724484489644
In [20]:
#cluster dataframe
kmm=pd.DataFrame(kmm)
kmm.columns=['mm_cluster']

#cluster number and number of projects for every micromarket
mm_proj=df1.groupby('micro_market_number')['project_number'].count().reset_index()
mm_proj=pd.concat([mm_proj, kmm],axis=1)
mm_proj
Out[20]:
micro_market_number project_number mm_cluster
0 1 49 16
1 10 12 14
2 100 5 12
3 101 49 2
4 102 33 12
5 104 2 2
6 105 3 10
7 106 55 19
8 107 7 12
9 108 14 0
10 109 49 2
11 11 28 14
12 110 129 19
13 111 1 6
14 112 1 9
15 113 3 2
16 114 181 0
17 115 9 11
18 116 13 7
19 118 5 8
20 119 15 15
21 12 27 14
22 120 8 15
23 121 26 17
24 122 18 8
25 123 33 5
26 124 10 1
27 13 52 6
28 14 2 9
29 15 3 9
... ... ... ...
83 69 46 9
84 7 27 6
85 70 7 9
86 72 50 2
87 73 32 2
88 75 95 3
89 76 6 16
90 77 4 3
91 79 8 3
92 8 15 16
93 80 2 17
94 81 1 7
95 82 12 11
96 84 12 1
97 85 8 1
98 86 11 11
99 87 5 5
100 88 5 18
101 89 2 18
102 9 21 6
103 90 1 5
104 91 5 13
105 92 5 5
106 93 3 4
107 94 4 4
108 95 4 0
109 96 5 19
110 97 40 2
111 98 11 10
112 99 5 19

113 rows × 3 columns

In [21]:
# Dictionary mapping micromarket number with cluster numbers
market_cluster = dict(zip(mm_proj['micro_market_number'], mm_proj['mm_cluster']))
In [22]:
# Create a new column in the dataset indicating cluster numbers, initialised with micromarket numbers
df1['mm_cluster']=df1['micro_market_number']
# Map cluster numbers to their respective micromarket numbers 
df1['mm_cluster']=df1['mm_cluster'].map(market_cluster)
df1['mm_cluster'].value_counts()
Out[22]:
9     559
2     394
19    347
0     230
6     218
16    187
14    129
12    126
3     124
17     73
5      46
8      42
11     32
1      30
15     23
10     19
13     14
7      14
4       7
18      7
Name: mm_cluster, dtype: int64
In [23]:
# How many projects are there for each cluster?
kmm_proj=df1.groupby('mm_cluster')['project_number'].count().reset_index()
kmm_proj=pd.DataFrame(kmm_proj)
kmm_proj
Out[23]:
mm_cluster project_number
0 0 230
1 1 30
2 2 394
3 3 124
4 4 7
5 5 46
6 6 218
7 7 14
8 8 42
9 9 559
10 10 19
11 11 32
12 12 126
13 13 14
14 14 129
15 15 23
16 16 187
17 17 73
18 18 7
19 19 347
In [24]:
# projects per cluster
sns.barplot(x="mm_cluster", y="project_number", data=kmm_proj)
Out[24]:
<matplotlib.axes._subplots.AxesSubplot at 0x1a59f775208>

Let's check stability of clusters by looking at how mean wap prices vary across clusters for every launch year.

We will compute Spearman's Rank Correlation to check for monotonic relations for the mean prices across clusters for all years.

In [25]:
year_mm_price=df1.groupby(['launch_year','mm_cluster'])['wap'].mean()
year_mm_price=pd.DataFrame(year_mm_price)

mean_wap_by_year=pd.concat([year_mm_price.loc[(2014)],year_mm_price.loc[(2015)],year_mm_price.loc[(2016)],year_mm_price.loc[(2017)],year_mm_price.loc[(2018)]],axis=1)
mean_wap_by_year.columns=['meanwap_2014','meanwap_2015','meanwap_2016','meanwap_2017','meanwap_2018']
mean_wap_by_year
Out[25]:
meanwap_2014 meanwap_2015 meanwap_2016 meanwap_2017 meanwap_2018
mm_cluster
0 9485.603483 11429.214727 9672.252469 9846.522545 11802.490393
1 29000.000000 31921.800000 32120.569167 30106.800000 NaN
2 4293.311314 4338.947165 4508.587906 5001.930194 4332.647120
3 18112.103745 17384.851852 17553.483153 17890.197668 17814.555000
4 36130.666667 37999.500000 55858.000000 NaN NaN
5 28082.162960 26854.000000 28538.773155 27601.714286 NaN
6 14098.454837 14831.488343 14715.145918 15174.651398 15406.955109
7 41070.750000 52483.166667 64374.666667 15714.000000 NaN
8 35808.407242 42769.880344 38872.774000 35593.090909 NaN
9 2984.480692 3023.583100 3159.066072 3250.648610 3179.228021
10 9100.000000 9265.000000 9033.200000 9276.500000 NaN
11 25540.571429 24137.198489 26291.311667 22824.759933 26730.000000
12 7241.750000 7190.230312 6923.446786 7284.184616 8434.699402
13 22224.488950 16400.000000 22492.666667 26016.000000 24048.565500
14 12218.709031 13146.632946 12832.710219 13602.809310 13159.716423
15 34678.250000 34907.799386 33380.333333 32685.333333 NaN
16 15442.331257 16442.129733 17007.092950 16575.809007 16303.168677
17 18537.288920 19194.296218 20681.789474 24589.081480 21229.112470
18 NaN 35000.000000 33000.000000 27220.483133 29757.626415
19 5245.697588 5253.664460 5337.699442 5756.918935 8158.254843
In [26]:
from scipy.stats import spearmanr

rho, pval = spearmanr(mean_wap_by_year,nan_policy='omit') # rho - correlation coefficient, p-value - signifies monotonic relationship
rho = pd.DataFrame(rho)
rho.index=['meanwap_2014','meanwap_2015','meanwap_2016','meanwap_2017','meanwap_2018']
rho.columns=['meanwap_2014','meanwap_2015','meanwap_2016','meanwap_2017','meanwap_2018']
rho
Out[26]:
meanwap_2014 meanwap_2015 meanwap_2016 meanwap_2017 meanwap_2018
meanwap_2014 1.000000 0.987719 1.000000 0.900929 1.000000
meanwap_2015 0.987719 1.000000 0.987970 0.873684 0.967033
meanwap_2016 1.000000 0.987970 1.000000 0.892982 1.000000
meanwap_2017 0.900929 0.873684 0.892982 1.000000 0.983516
meanwap_2018 1.000000 0.967033 1.000000 0.983516 1.000000

Given the high value of Spearman's Rank Correlation coefficients, there's a strong monotonic relation between mean wap prices across clusters throughout various years. The clusters are indeed stable.

In [27]:
#Number of micromarkets in each cluster
kmm_micro=mm_proj.groupby('mm_cluster')['micro_market_number'].count().reset_index()
kmm_micro=pd.DataFrame(kmm_micro)
kmm_micro
Out[27]:
mm_cluster micro_market_number
0 0 4
1 1 3
2 2 18
3 3 4
4 4 2
5 5 5
6 6 9
7 7 2
8 8 3
9 9 23
10 10 4
11 11 3
12 12 6
13 13 2
14 14 6
15 15 2
16 16 6
17 17 3
18 18 2
19 19 6
In [28]:
# Visual representation of number of micromarkets in each cluster
sns.barplot(x="mm_cluster", y="micro_market_number", data=kmm_micro)
Out[28]:
<matplotlib.axes._subplots.AxesSubplot at 0x1a59c430940>

Note that the initial centroids of kmeans algorithms are not same every time the algorithm is run. Therefore, the micromarket frequency counts for each cluster will be different every time this notebook is run.

In [29]:
# wap against different clusters
f, ax = plt.subplots(figsize=(20,10))
sns.boxplot(x="mm_cluster", y="wap", data=df1)
Out[29]:
<matplotlib.axes._subplots.AxesSubplot at 0x1a59fd89780>

Micromarket clustering explains the variation in weighted average prices well. Micromarket clusters is thus a good predictor for wap.

In [30]:
sns.catplot(x="mm_cluster", y="wap", col="zone", col_wrap=2, data=df1, kind="strip", height=5, aspect=1.1, palette='Set2')
Out[30]:
<seaborn.axisgrid.FacetGrid at 0x1a59fd82048>

Stripplots also show frequency counts of a class of a categorical data apart from showing how a continous target variable's tendency of centrality and variance on it.

In [31]:
sns.catplot(x="mm_cluster", y="wap", col="proj_bedrooms_clean",col_wrap=2, data=df1, kind="strip", height=5, aspect=1.4, palette='Set2')
Out[31]:
<seaborn.axisgrid.FacetGrid at 0x1a5a0c7d940>

Projects with greater number of bedrooms are pricier across all micromarkets. However, such projects have low frequency.

In [32]:
sns.catplot(x="mm_cluster", y="wap", col="launch_year", data=df1, col_wrap=2, kind="strip", height=5, aspect=1.2, palette='Set2')
Out[32]:
<seaborn.axisgrid.FacetGrid at 0x1a5a2909a20>

Prices are time invariant for different clusters of micromarkets. Prices tend to decrease over time across all micromarket clusters.

In [33]:
sns.catplot(x="mm_cluster", y="wap", col="construction_status",col_wrap=2, data=df1, kind="strip", height=6, aspect=1, palette='Set2')
Out[33]:
<seaborn.axisgrid.FacetGrid at 0x1a5a3429128>

Clustering of micromarkets indeed explains price variations significantly. Significant variations of prices are visible across zones and different segments of other categorical variables as well. Since clustering/grouping of micromarkets was done based on mean prices, this shows that prices are also strongly influenced by other factors.

2.2 Exploratory Data Analysis for Wap Against Different Possible Predictor Variables

2.2.1 Contingency Tables of Categorical Variables

Let us begin by looking at frequency distributions of micromarket clusters and several categorical features. We begin with crosstab tables involving unit-type, project bedrooms and construction status.

In [34]:
sns.heatmap(pd.crosstab(df1['construction_status'],df1['unit_type']),annot=True, fmt='d',cmap='gist_gray')
Out[34]:
<matplotlib.axes._subplots.AxesSubplot at 0x1a5a3306c88>
In [35]:
sns.heatmap(pd.crosstab(df1['proj_bedrooms_clean'], df1['unit_type']),annot=True, fmt='d',cmap='gist_gray')
Out[35]:
<matplotlib.axes._subplots.AxesSubplot at 0x1a5a3fceeb8>

Number of apartments overshadow that of villas. Most of the projects are already understruction and consist of 1 and 2 bedrooms.

In [36]:
sns.heatmap(pd.crosstab(df1['proj_bedrooms_clean'],df1['construction_status']),annot=True, fmt='d',cmap='gist_gray',robust=True)
Out[36]:
<matplotlib.axes._subplots.AxesSubplot at 0x1a5a405da58>

Now, let us look at frequency distributions of micromarket clusters across unit-types, number of project bedrooms and construction status of projects.

In [37]:
fig,ax=plt.subplots(figsize=(10,10))
fig=sns.heatmap(pd.crosstab(df1['mm_cluster'],df1['construction_status']),annot=True, fmt='d',cmap='Blues')
In [38]:
fig,ax=plt.subplots(figsize=(10,10))
fig=sns.heatmap(pd.crosstab(df1['mm_cluster'],df1['unit_type']),annot=True, fmt='d',cmap='copper')
In [39]:
fig,ax=plt.subplots(figsize=(10,10))
fig=sns.heatmap(pd.crosstab(df1['mm_cluster'],df1['proj_bedrooms_clean']),annot=True, fmt='d',cmap='copper')

Micromarket frequencies are not evenly distributed across all categories of unit-type, project bedrooms and construction.

In [40]:
fig,ax=plt.subplots(figsize=(10,10))
fig=sns.heatmap(pd.crosstab(df1['mm_cluster'],df1['zone']),annot=True, fmt='d',cmap='copper')

2.2.2 How Prices of Different Segments of Categorical Variables Vary Across Zones

In [41]:
f, ax = plt.subplots(figsize=(10,5))
sns.boxplot(x="zone", y="wap",data=df1)
Out[41]:
<matplotlib.axes._subplots.AxesSubplot at 0x1a5a4178dd8>

As shown just now, wap varies significant among different zones. Zone J is the costliest area due to highest mean price. Zone B is cheapest.

In [42]:
sns.catplot(x="zone", y="wap", col="proj_bedrooms_clean",col_wrap=2,data=df1, kind="bar", height=5, aspect=1.3, palette='Set3',ci='sd')
Out[42]:
<seaborn.axisgrid.FacetGrid at 0x1a5a3431c88>

The barplots show mean prices while the error bar depicts variation in prices.

In [43]:
sns.catplot(x="zone", y="wap", col="unit_type",col_wrap=2, data=df1, kind="bar", height=6, aspect=1.1, palette='Set2')
Out[43]:
<seaborn.axisgrid.FacetGrid at 0x1a5a4b8bb00>

Villas have far fewer occurences compared to apartments as most people are not wealthy enough to afford a villa. Villas do not occur in all zones.

In [44]:
sns.catplot(x="zone", y="wap", col="construction_status",col_wrap=2,data=df1, kind="bar", height=6, aspect=1.1)
Out[44]:
<seaborn.axisgrid.FacetGrid at 0x1a5a47970b8>

Prices varies with project bedroom numbers, unit-types and construction status differently across different years.

2.2.3 How Prices of Different Segments of Categorical Variables Vary Over Years

In [45]:
f, ax = plt.subplots(figsize=(10,5))
sns.boxplot(x="launch_year", y="wap", data=df1,palette='Set1')
Out[45]:
<matplotlib.axes._subplots.AxesSubplot at 0x1a5a4a21320>

Prices are definitely not time invariant. Let's observe how they vary with time across different segments of categorical variables.

A point plot represents an estimate of central tendency for a numeric variable by the position of scatter plot points and provides some indication of the uncertainty around that estimate using error bars. We are using pointplots to analyse how prices vary across years for different categories of some predictor variables.

In [46]:
sns.catplot(x="launch_year", y="wap", col="proj_bedrooms_clean",col_wrap=2, data=df1, kind="point", height=5, aspect=1, color='darkgreen')
Out[46]:
<seaborn.axisgrid.FacetGrid at 0x1a5a488e438>

Generally, the greater the number of bedrooms, the pricier an apartment is. Prices declined over time in spite of temporary spike.

In [47]:
sns.catplot(x="launch_year", y="wap", col="construction_status",col_wrap=2, data=df1, kind="point", height=5, aspect=1, color='darkorange')
Out[47]:
<seaborn.axisgrid.FacetGrid at 0x1a5a4a977f0>

Projects that commenced earlier have definitely completed earlier. Very few projects are under pre-launch at the time of analysis.

In [48]:
sns.pointplot(x="launch_year", y="wap", hue="unit_type",data=df1, markers=["o",'*'], linestyles=["-",'--'])
Out[48]:
<matplotlib.axes._subplots.AxesSubplot at 0x1a5a541d780>

Unit type definitely influences prices.

General trend is that prices decrease over time for all categories of a categorical feature. An exception is the pre-launch project prices. This is probably because speculated prices before a project commences are subjected to changes by various market factors.

2.2.4 EDA Involving Prices Against Categorical Variables Across Different Segments of Other Categorical Variables

In [49]:
sns.boxplot(x="unit_type", y="wap", data=df1, palette='Set1')
Out[49]:
<matplotlib.axes._subplots.AxesSubplot at 0x1a5a5401438>
In [50]:
sns.boxplot(x="proj_bedrooms_clean", y="wap", data=df1, palette='Set1')
Out[50]:
<matplotlib.axes._subplots.AxesSubplot at 0x1a5a71c5550>
In [51]:
sns.boxplot(x="construction_status", y="wap", data=df1, palette='Set1')
Out[51]:
<matplotlib.axes._subplots.AxesSubplot at 0x1a5a7258be0>

Compared to project bedrooms and unit types, construction status seem to contribute to least variance of prices within interquartile range across its different categories.

In [52]:
sns.catplot(x="proj_bedrooms_clean", y="wap", col="construction_status",col_wrap=2, data=df1, kind="box", height=5, aspect=1, palette='Set3')
Out[52]:
<seaborn.axisgrid.FacetGrid at 0x1a5a72503c8>

Flats under construction - those with 4 bedrooms are costliest. Completed flats- those with 3 bedrooms are costliest.

In [53]:
sns.catplot(x="proj_bedrooms_clean", y="wap", col="unit_type",col_wrap=2, data=df1, kind="box", height=5, aspect=1.1, palette='Set3')
Out[53]:
<seaborn.axisgrid.FacetGrid at 0x1a5a76c9b38>

Apartments- 4 bedrooms and more are costliest. Those with one bedrooms least expensive. Villa- Largely sparse. 2 bedrooms and more units are costlier. Those with one bedrooms least expensive.

In [54]:
sns.catplot(x="unit_type", y="wap", col="construction_status",col_wrap=2, data=df1, kind="box", height=5, aspect=1, palette='Set3')
Out[54]:
<seaborn.axisgrid.FacetGrid at 0x1a5a53e7eb8>

Apartments show greater central tendency and variation in prices due to higher frequency counts and demand compared to villa whose individual unit is pricier as only economically elites can afford villas.

2.2.5 EDA Involving Prices Against Continous Variables Across Different Segments of Categorical Variables

Now, let's look at how maxsize and minsize varies with wap across different segments of various categorical variables.

In [55]:
sns.lmplot(x="min_size", y="wap",col='zone',hue='zone',col_wrap=2, height=5, aspect=1, data=df1)
Out[55]:
<seaborn.axisgrid.FacetGrid at 0x1a5a78fabe0>
In [56]:
sns.lmplot(x="max_size", y="wap",col='zone',hue='zone',col_wrap=2, height=5, aspect=1, data=df1)
Out[56]:
<seaborn.axisgrid.FacetGrid at 0x1a5a891e5f8>

For different zones, variations in correlation between prices and min_size, as well as prices and max_size. Many of the predictor variables are strongly correlated with zone.

Strong multicollinearity between min_size and max_size suspected.

In [57]:
sns.lmplot(x="max_size", y="wap",col='launch_year',hue='launch_year',col_wrap=2, height=5, aspect=1, data=df1)
Out[57]:
<seaborn.axisgrid.FacetGrid at 0x1a5a96cb710>

Max_size varies across years. It's sharpest increase is recorded in 2016.

In [58]:
sns.lmplot(x="max_size", y="wap",col='proj_bedrooms_clean',hue='proj_bedrooms_clean',col_wrap=2, height=5, aspect=1,data=df1)
Out[58]:
<seaborn.axisgrid.FacetGrid at 0x1a5a96cbf98>

Prices are invariant for all maximum sizes regardless of number of bedrooms.

In [59]:
sns.lmplot(x="max_size", y="wap",col='construction_status',hue='construction_status',col_wrap=2, height=5,aspect=1, data=df1)
sns.lmplot(x="max_size", y="wap",col='unit_type',hue='unit_type',col_wrap=2, height=5, aspect=1,data=df1)
Out[59]:
<seaborn.axisgrid.FacetGrid at 0x1a5ab83b550>

Unit type and construction status across maximum sizes of projects also influence prices.

Max_size not affected by proj_bedrooms_clean as it seems.

In [60]:
# Line plots involving prices against bedroom by typology
f, ax = plt.subplots(4,2 ,figsize=(20,10))
sns.lineplot(x="bp1", y='wap', data=df1, color="black",ax=ax[0,0])
sns.lineplot(x="bp2", y='wap', data=df1, color="yellow",ax=ax[0,1])
sns.lineplot(x="bp3", y='wap', data=df1, color="green",ax=ax[1,0])
sns.lineplot(x="bp4", y='wap', data=df1, color="maroon",ax=ax[1,1])
sns.lineplot(x="bp5", y='wap', data=df1, color="orange",ax=ax[2,0])
sns.lineplot(x="bp6", y='wap', data=df1, color="darkblue",ax=ax[2,1])
sns.lineplot(x="bp7", y='wap', data=df1, color="red",ax=ax[3,0])
sns.lineplot(x="bp8", y='wap', data=df1, color="indigo",ax=ax[3,1])
Out[60]:
<matplotlib.axes._subplots.AxesSubplot at 0x1a5ac253e80>

Except for bp1, increasing trend between prices and bedrooms by typology per project. This is because number of bedrooms in an unit influences price. Figures portray noisy data due to sparsity of columns.

2.3 Feature Selection Test / Analysis

2.3.1 Feature Selection For Categorical Predictor Variables

In [61]:
# Split up wap into intervals
df1['wap_interval'] = pd.cut(df1['wap'], bins=[0,5e+3,1e+4,3e+4,9e+4]) 
df1['wap_interval'].value_counts()
Out[61]:
(0.0, 5000.0]         1017
(10000.0, 30000.0]     927
(5000.0, 10000.0]      565
(30000.0, 90000.0]     112
Name: wap_interval, dtype: int64
In [62]:
from scipy.stats import chi2_contingency

print("-----P-Values of two-way Chi-Square Test involving categorical predictors-----")
for i in ['construction_status','unit_type','proj_bedrooms_clean','launch_year','mm_cluster','zone']:
    cont=pd.crosstab(df1['wap_interval'],df1[i]) #Contingency table
    g, p, dof, expctd = chi2_contingency(cont)
    print(i,": ", p)
-----P-Values of two-way Chi-Square Test involving categorical predictors-----
construction_status :  1.1780529837133837e-05
unit_type :  9.917379872277916e-06
proj_bedrooms_clean :  9.273133094395507e-141
launch_year :  0.12845156266274824
mm_cluster :  0.0
zone :  0.0

2.3.2 Feature Selection For Continuous Predictor Variables

Based on inferences from clustermap, certain pairs of variables are not advised to be used simultaneously in regression:

1) max_size and min_size

2) bp6 and bp7

3) bp5 and bp8

In [63]:
# Dendogram Linkage between continuous variables showing links based on correlation coefficients
include=['wap','max_size','min_size','bp1','bp2','bp3','bp4','bp5','bp6','bp7','bp8'] 
sns.clustermap(pd.DataFrame(df1[include]).corr(),annot=True, method='complete')# Complete linkage used for cluster analysis
Out[63]:
<seaborn.matrix.ClusterGrid at 0x1a5abbfca58>

It's desirable to have low correlations among independent variables.

Max_size and min_size should not be used simultaneously to avoid collinearity in regression model. Let's choose max_size instead first.

In [64]:
from sklearn.feature_selection import mutual_info_regression

print("-----P-Values of Mutual Information Regression Test involving continuous predictors-----")
for i in ['min_size','max_size','bp1','bp2','bp3','bp4','bp5','bp6','bp7','bp8']:
    com=df1[i].values
    com=com.reshape(-1,1)
    mi=mutual_info_regression(com, df1['wap'])
    print(i,": ", mi)
-----P-Values of Mutual Information Regression Test involving continuous predictors-----
min_size :  [0.24413104]
max_size :  [0.2852419]
bp1 :  [0.28360688]
bp2 :  [0.14796047]
bp3 :  [0.1758102]
bp4 :  [0.08078205]
bp5 :  [0.01875768]
bp6 :  [0.00261264]
bp7 :  [0.01031861]
bp8 :  [0.00560293]

The mutual information score of two random variables is a measure of the mutual dependence between the two variables. The higher the value of that score, the higher the dependency.

More information on mutual information on mutual information found here: http://www.scholarpedia.org/article/Mutual_information

The continuous variables to be chosen are max_size, bp1, bp2 and bp3.

3. Base Model of Log-Transformed Wap

3.1 Base Model Selection

Let's fit all the categorical features discussed so far and the 4 selected numerical features into a linear regression model. Try with log-transformed price as dependent variable first. Log-transformation is known to minimise skewness and ensure normality in residual distribution.

In [65]:
import statsmodels.api as sm
import statsmodels.formula.api as smf

model = smf.ols(formula='log_wap ~ max_size + C(mm_cluster) + bp1 + bp2 + bp3 + C(launch_year) + C(unit_type) + C(proj_bedrooms_clean) + C(construction_status) + C(zone)', data=df1).fit()
print(model.summary())
                            OLS Regression Results                            
==============================================================================
Dep. Variable:                log_wap   R-squared:                       0.923
Model:                            OLS   Adj. R-squared:                  0.922
Method:                 Least Squares   F-statistic:                     722.7
Date:                Mon, 11 Mar 2019   Prob (F-statistic):               0.00
Time:                        02:02:16   Log-Likelihood:                 311.20
No. Observations:                2621   AIC:                            -534.4
Df Residuals:                    2577   BIC:                            -276.1
Df Model:                          43                                         
Covariance Type:            nonrobust                                         
================================================================================================================
                                                   coef    std err          t      P>|t|      [0.025      0.975]
----------------------------------------------------------------------------------------------------------------
Intercept                                        9.4992      0.083    115.004      0.000       9.337       9.661
C(mm_cluster)[T.1]                               0.9727      0.071     13.685      0.000       0.833       1.112
C(mm_cluster)[T.2]                              -0.7704      0.047    -16.529      0.000      -0.862      -0.679
C(mm_cluster)[T.3]                               0.4654      0.059      7.841      0.000       0.349       0.582
C(mm_cluster)[T.4]                               1.0544      0.108      9.749      0.000       0.842       1.266
C(mm_cluster)[T.5]                               0.8708      0.062     13.934      0.000       0.748       0.993
C(mm_cluster)[T.6]                               0.2655      0.043      6.237      0.000       0.182       0.349
C(mm_cluster)[T.7]                               1.2608      0.082     15.441      0.000       1.101       1.421
C(mm_cluster)[T.8]                               1.0701      0.056     18.950      0.000       0.959       1.181
C(mm_cluster)[T.9]                              -1.0997      0.048    -22.827      0.000      -1.194      -1.005
C(mm_cluster)[T.10]                             -0.1652      0.063     -2.620      0.009      -0.289      -0.042
C(mm_cluster)[T.11]                              0.7633      0.071     10.737      0.000       0.624       0.903
C(mm_cluster)[T.12]                             -0.3249      0.050     -6.547      0.000      -0.422      -0.228
C(mm_cluster)[T.13]                              0.5936      0.072      8.280      0.000       0.453       0.734
C(mm_cluster)[T.14]                              0.1691      0.044      3.845      0.000       0.083       0.255
C(mm_cluster)[T.15]                              1.0360      0.071     14.509      0.000       0.896       1.176
C(mm_cluster)[T.16]                              0.3659      0.043      8.522      0.000       0.282       0.450
C(mm_cluster)[T.17]                              0.5500      0.050     10.898      0.000       0.451       0.649
C(mm_cluster)[T.18]                              0.9138      0.105      8.700      0.000       0.708       1.120
C(mm_cluster)[T.19]                             -0.5918      0.047    -12.705      0.000      -0.683      -0.500
C(launch_year)[T.2015]                           0.0452      0.012      3.792      0.000       0.022       0.069
C(launch_year)[T.2016]                           0.0704      0.013      5.426      0.000       0.045       0.096
C(launch_year)[T.2017]                           0.1057      0.014      7.462      0.000       0.078       0.133
C(launch_year)[T.2018]                           0.1523      0.023      6.691      0.000       0.108       0.197
C(unit_type)[T.Villa]                           -0.1000      0.049     -2.020      0.043      -0.197      -0.003
C(proj_bedrooms_clean)[T.2 and more]            -0.0113      0.016     -0.711      0.477      -0.043       0.020
C(proj_bedrooms_clean)[T.3 and more]             0.0031      0.033      0.093      0.926      -0.061       0.068
C(proj_bedrooms_clean)[T.4 and more]             0.0766      0.074      1.031      0.303      -0.069       0.222
C(proj_bedrooms_clean)[T.6 and more]            -0.0546      0.128     -0.425      0.671      -0.306       0.197
C(construction_status)[T.Pre Launch]            -0.1108      0.090     -1.235      0.217      -0.287       0.065
C(construction_status)[T.Under Construction]    -0.0331      0.011     -2.903      0.004      -0.056      -0.011
C(zone)[T.Zone B]                               -0.1073      0.068     -1.588      0.112      -0.240       0.025
C(zone)[T.Zone C]                               -0.1025      0.060     -1.711      0.087      -0.220       0.015
C(zone)[T.Zone D]                               -0.0230      0.020     -1.152      0.250      -0.062       0.016
C(zone)[T.Zone E]                               -0.0789      0.060     -1.306      0.192      -0.197       0.040
C(zone)[T.Zone F]                               -0.0571      0.051     -1.126      0.260      -0.156       0.042
C(zone)[T.Zone G]                               -0.0240      0.053     -0.450      0.653      -0.129       0.081
C(zone)[T.Zone H]                               -0.0546      0.058     -0.944      0.345      -0.168       0.059
C(zone)[T.Zone I]                               -0.1342      0.045     -2.976      0.003      -0.223      -0.046
C(zone)[T.Zone J]                               -0.0143      0.040     -0.360      0.719      -0.092       0.063
max_size                                      2.357e-05   1.08e-05      2.185      0.029    2.42e-06    4.47e-05
bp1                                             -0.3655      0.067     -5.439      0.000      -0.497      -0.234
bp2                                             -0.2262      0.064     -3.536      0.000      -0.352      -0.101
bp3                                             -0.0938      0.061     -1.546      0.122      -0.213       0.025
==============================================================================
Omnibus:                      219.463   Durbin-Watson:                   1.694
Prob(Omnibus):                  0.000   Jarque-Bera (JB):             1268.729
Skew:                          -0.120   Prob(JB):                    3.16e-276
Kurtosis:                       6.400   Cond. No.                     5.37e+04
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 5.37e+04. This might indicate that there are
strong multicollinearity or other numerical problems.

Let's try cutting down number of categorical features and see if the adjusted R-squared values change or not. Construction status contributed least variance over prices. Drop it. Drop zone as well as clustered micromarkets are defined by zones as we have seen earlier.

In [66]:
mod1 = smf.ols(formula='log_wap ~ max_size + C(mm_cluster) + bp1 + bp2 + bp3 + C(launch_year) + C(unit_type) + C(proj_bedrooms_clean)', data=df1).fit()
print(mod1.summary())
                            OLS Regression Results                            
==============================================================================
Dep. Variable:                log_wap   R-squared:                       0.923
Model:                            OLS   Adj. R-squared:                  0.922
Method:                 Least Squares   F-statistic:                     963.0
Date:                Mon, 11 Mar 2019   Prob (F-statistic):               0.00
Time:                        02:02:16   Log-Likelihood:                 295.77
No. Observations:                2621   AIC:                            -525.5
Df Residuals:                    2588   BIC:                            -331.8
Df Model:                          32                                         
Covariance Type:            nonrobust                                         
========================================================================================================
                                           coef    std err          t      P>|t|      [0.025      0.975]
--------------------------------------------------------------------------------------------------------
Intercept                                9.3404      0.071    130.785      0.000       9.200       9.480
C(mm_cluster)[T.1]                       1.0709      0.043     24.938      0.000       0.987       1.155
C(mm_cluster)[T.2]                      -0.7347      0.019    -39.316      0.000      -0.771      -0.698
C(mm_cluster)[T.3]                       0.5270      0.024     21.625      0.000       0.479       0.575
C(mm_cluster)[T.4]                       1.1371      0.088     12.885      0.000       0.964       1.310
C(mm_cluster)[T.5]                       0.9714      0.036     27.282      0.000       0.902       1.041
C(mm_cluster)[T.6]                       0.3742      0.021     18.082      0.000       0.334       0.415
C(mm_cluster)[T.7]                       1.3652      0.062     21.945      0.000       1.243       1.487
C(mm_cluster)[T.8]                       1.1758      0.038     30.750      0.000       1.101       1.251
C(mm_cluster)[T.9]                      -1.0837      0.018    -60.326      0.000      -1.119      -1.048
C(mm_cluster)[T.10]                     -0.0998      0.052     -1.919      0.055      -0.202       0.002
C(mm_cluster)[T.11]                      0.8541      0.041     20.582      0.000       0.773       0.935
C(mm_cluster)[T.12]                     -0.2813      0.024    -11.615      0.000      -0.329      -0.234
C(mm_cluster)[T.13]                      0.6774      0.060     11.240      0.000       0.559       0.796
C(mm_cluster)[T.14]                      0.2753      0.024     11.467      0.000       0.228       0.322
C(mm_cluster)[T.15]                      1.1437      0.048     23.836      0.000       1.050       1.238
C(mm_cluster)[T.16]                      0.4711      0.022     21.856      0.000       0.429       0.513
C(mm_cluster)[T.17]                      0.6575      0.029     22.306      0.000       0.600       0.715
C(mm_cluster)[T.18]                      1.0048      0.084     11.954      0.000       0.840       1.170
C(mm_cluster)[T.19]                     -0.5518      0.019    -29.409      0.000      -0.589      -0.515
C(launch_year)[T.2015]                   0.0388      0.012      3.304      0.001       0.016       0.062
C(launch_year)[T.2016]                   0.0541      0.012      4.461      0.000       0.030       0.078
C(launch_year)[T.2017]                   0.0846      0.013      6.463      0.000       0.059       0.110
C(launch_year)[T.2018]                   0.1325      0.022      6.008      0.000       0.089       0.176
C(unit_type)[T.Villa]                   -0.0955      0.049     -1.933      0.053      -0.192       0.001
C(proj_bedrooms_clean)[T.2 and more]    -0.0045      0.016     -0.281      0.778      -0.036       0.027
C(proj_bedrooms_clean)[T.3 and more]     0.0138      0.033      0.421      0.674      -0.051       0.078
C(proj_bedrooms_clean)[T.4 and more]     0.0933      0.074      1.254      0.210      -0.053       0.239
C(proj_bedrooms_clean)[T.6 and more]    -0.0673      0.128     -0.524      0.600      -0.319       0.185
max_size                              2.781e-05   1.06e-05      2.613      0.009    6.94e-06    4.87e-05
bp1                                     -0.3344      0.067     -4.983      0.000      -0.466      -0.203
bp2                                     -0.2070      0.064     -3.232      0.001      -0.333      -0.081
bp3                                     -0.0848      0.061     -1.396      0.163      -0.204       0.034
==============================================================================
Omnibus:                      206.328   Durbin-Watson:                   1.675
Prob(Omnibus):                  0.000   Jarque-Bera (JB):             1103.151
Skew:                          -0.123   Prob(JB):                    2.84e-240
Kurtosis:                       6.169   Cond. No.                     3.70e+04
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 3.7e+04. This might indicate that there are
strong multicollinearity or other numerical problems.

We observe that no deterioration in adjusted R-square value.

In [67]:
mod = smf.ols(formula='wap ~ max_size + C(mm_cluster) + bp1 + bp2 + bp3 + C(launch_year) + C(unit_type) + C(proj_bedrooms_clean)', data=df1).fit()
print(mod.summary())
                            OLS Regression Results                            
==============================================================================
Dep. Variable:                    wap   R-squared:                       0.889
Model:                            OLS   Adj. R-squared:                  0.888
Method:                 Least Squares   F-statistic:                     648.4
Date:                Mon, 11 Mar 2019   Prob (F-statistic):               0.00
Time:                        02:02:16   Log-Likelihood:                -24728.
No. Observations:                2621   AIC:                         4.952e+04
Df Residuals:                    2588   BIC:                         4.971e+04
Df Model:                          32                                         
Covariance Type:            nonrobust                                         
========================================================================================================
                                           coef    std err          t      P>|t|      [0.025      0.975]
--------------------------------------------------------------------------------------------------------
Intercept                              1.27e+04   1000.269     12.695      0.000    1.07e+04    1.47e+04
C(mm_cluster)[T.1]                    1.997e+04    601.424     33.198      0.000    1.88e+04    2.11e+04
C(mm_cluster)[T.2]                   -4889.3082    261.716    -18.682      0.000   -5402.502   -4376.115
C(mm_cluster)[T.3]                    7061.7882    341.325     20.689      0.000    6392.490    7731.086
C(mm_cluster)[T.4]                    2.644e+04   1235.973     21.392      0.000     2.4e+04    2.89e+04
C(mm_cluster)[T.5]                    1.657e+04    498.706     33.229      0.000    1.56e+04    1.75e+04
C(mm_cluster)[T.6]                    4281.7754    289.818     14.774      0.000    3713.477    4850.073
C(mm_cluster)[T.7]                    3.486e+04    871.334     40.011      0.000    3.32e+04    3.66e+04
C(mm_cluster)[T.8]                    2.523e+04    535.564     47.110      0.000    2.42e+04    2.63e+04
C(mm_cluster)[T.9]                   -6187.7689    251.603    -24.593      0.000   -6681.132   -5694.406
C(mm_cluster)[T.10]                   -752.5512    728.149     -1.034      0.301   -2180.364     675.261
C(mm_cluster)[T.11]                    1.38e+04    581.230     23.742      0.000    1.27e+04    1.49e+04
C(mm_cluster)[T.12]                  -2425.7306    339.183     -7.152      0.000   -3090.828   -1760.633
C(mm_cluster)[T.13]                   9969.6284    844.049     11.812      0.000    8314.548    1.16e+04
C(mm_cluster)[T.14]                   3064.9951    336.210      9.116      0.000    2405.728    3724.262
C(mm_cluster)[T.15]                   2.275e+04    672.027     33.859      0.000    2.14e+04    2.41e+04
C(mm_cluster)[T.16]                   5937.2352    301.916     19.665      0.000    5345.213    6529.257
C(mm_cluster)[T.17]                   9331.7887    412.866     22.602      0.000    8522.208    1.01e+04
C(mm_cluster)[T.18]                   1.809e+04   1177.194     15.363      0.000    1.58e+04    2.04e+04
C(mm_cluster)[T.19]                  -4075.2780    262.813    -15.506      0.000   -4590.623   -3559.933
C(launch_year)[T.2015]                 562.6907    164.493      3.421      0.001     240.140     885.242
C(launch_year)[T.2016]                 828.9177    169.875      4.880      0.000     495.814    1162.022
C(launch_year)[T.2017]                 897.4419    183.335      4.895      0.000     537.945    1256.939
C(launch_year)[T.2018]                1453.3307    308.777      4.707      0.000     847.855    2058.806
C(unit_type)[T.Villa]                -2667.5384    692.187     -3.854      0.000   -4024.835   -1310.242
C(proj_bedrooms_clean)[T.2 and more]    70.2305    222.157      0.316      0.752    -365.393     505.854
C(proj_bedrooms_clean)[T.3 and more]  1191.4271    460.495      2.587      0.010     288.450    2094.404
C(proj_bedrooms_clean)[T.4 and more]  3979.1253   1041.596      3.820      0.000    1936.680    6021.570
C(proj_bedrooms_clean)[T.6 and more] -1156.2855   1798.828     -0.643      0.520   -4683.573    2371.002
max_size                                 0.6849      0.149      4.595      0.000       0.393       0.977
bp1                                  -4586.5127    939.880     -4.880      0.000   -6429.505   -2743.520
bp2                                  -4103.3483    896.823     -4.575      0.000   -5861.911   -2344.785
bp3                                  -2398.1211    850.958     -2.818      0.005   -4066.748    -729.494
==============================================================================
Omnibus:                      790.794   Durbin-Watson:                   1.961
Prob(Omnibus):                  0.000   Jarque-Bera (JB):            46056.169
Skew:                           0.585   Prob(JB):                         0.00
Kurtosis:                      23.503   Cond. No.                     3.70e+04
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 3.7e+04. This might indicate that there are
strong multicollinearity or other numerical problems.

The standard errors are larger and the adjusted R-squared value is lower, suggesting greater degree of residuals.

In [68]:
def residual_log (model, dataframe, residual_log, residual):
    # Function To Compute and return residual summary statistics for a model
    predicted_lwap = model.predict()
    predicted_wap=np.exp(predicted_lwap)
    dataframe[residual_log]=dataframe['log_wap'] - predicted_lwap
    dataframe[residual]=dataframe['wap'] - predicted_wap
    return (dataframe[residual_log].describe(), print('median:  ',dataframe[residual_log].median()))
In [69]:
def percentage_error_projects (dataframe, residual, percentage_error,percentage_error_interval):
    #Calculate percentage error w.r.t wap
    dataframe[percentage_error]=(dataframe[residual]/dataframe['wap'])*100
    #Split percentage error into equal intervals
    dataframe[percentage_error_interval] = pd.cut(dataframe[percentage_error], bins=[-float("inf"),-30,-20,-10,0,10,20,30,float("inf")])
    # How many projects in each interval of wap?
    grouped_df = dataframe.groupby(percentage_error_interval)['project_number'].count().reset_index()
    # Percentage of projects under each interval
    grouped_df['percentage_projects']=(grouped_df['project_number']/grouped_df['project_number'].sum())*100
    return (grouped_df)    
In [70]:
residual_log (mod1, df1, 'residual_log','residual') # Residual statistics for log-transformed proposed base model
median:   0.004117609696402624
Out[70]:
(count    2.621000e+03
 mean     5.606598e-13
 std      2.161909e-01
 min     -1.345698e+00
 25%     -1.274426e-01
 50%      4.117610e-03
 75%      1.236302e-01
 max      1.146930e+00
 Name: residual_log, dtype: float64, None)
In [71]:
pred_lwap = mod1.predict() #Predicted values of log-transformed wap
pred_wap=np.exp(pred_lwap) #Predicted log-transformed wap values transformed into absolute values

#Include values for predicted values of wap and log-wap derived from base model with log-transformation into dataframe
pred_lwap=pd.DataFrame(pred_lwap)
pred_wap=pd.DataFrame(pred_wap)
pred_lwap.columns=['pred_lwap']
pred_wap.columns=['pred_wap']

df1=pd.concat([df1,pred_lwap,pred_wap],axis=1)
df1.head()
Out[71]:
zone micro_market_number developer_number project_number wap log_wap proj_bedrooms_clean unit_type proj_launched_units launch_year ... bp6 bp7 bp8 b48 mm_cluster wap_interval residual_log residual pred_lwap pred_wap
0 Zone A 1 1 1 14464.0000 9.579418 1 and more Apartment 111 2015 ... 0.0 0.0 0.0 0 16 (10000.0, 30000.0] 0.037011 525.544023 9.542407 13938.455977
1 Zone A 1 2 2 13982.0000 9.545526 2 and more Apartment 150 2014 ... 0.0 0.0 0.0 0 16 (10000.0, 30000.0] -0.124804 -1858.581942 9.670330 15840.581942
2 Zone A 1 2 3 13982.0000 9.545526 2 and more Apartment 150 2014 ... 0.0 0.0 0.0 0 16 (10000.0, 30000.0] -0.124804 -1858.581942 9.670330 15840.581942
3 Zone A 1 5 6 12857.0000 9.461644 1 and more Apartment 120 2016 ... 0.0 0.0 0.0 0 16 (10000.0, 30000.0] -0.099842 -1349.933397 9.561485 14206.933397
4 Zone A 1 7 8 16981.5758 9.739884 1 and more Apartment 33 2015 ... 0.0 0.0 0.0 0 16 (10000.0, 30000.0] 0.015628 263.318500 9.724257 16718.257300

5 rows × 36 columns

In [72]:
percentage_error_projects (df1, 'residual', 'percentage_error','percentage_error_interval')
Out[72]:
percentage_error_interval project_number percentage_projects
0 (-inf, -30.0] 231 8.813430
1 (-30.0, -20.0] 196 7.478062
2 (-20.0, -10.0] 378 14.421976
3 (-10.0, 0.0] 482 18.389928
4 (0.0, 10.0] 593 22.624952
5 (10.0, 20.0] 439 16.749332
6 (20.0, 30.0] 186 7.096528
7 (30.0, inf] 116 4.425792
In [73]:
#Mean Absolute Percentage Error of log-transformed proposed base model
np.absolute(df1['percentage_error']).mean()
Out[73]:
16.370813489410583
In [74]:
# Lets investigate normal distribution properties of log-transformed residual and absolute residuals of base model
f, ax = plt.subplots(1,2,figsize=(5,5))
sns.distplot(df1['residual'],color="y", ax=ax[0])
sns.distplot(df1['residual_log'],color="y",ax=ax[1])
Out[74]:
<matplotlib.axes._subplots.AxesSubplot at 0x1a5aff24470>
In [75]:
print("Skewness of residual: %f" % df1['residual'].skew())
print("Kurtosis of residual: %f" % df1['residual'].kurt())
print("Skewness of residual_log: %f" % df1['residual_log'].skew())
print("Kurtosis of residual_log: %f" % df1['residual_log'].kurt())
Skewness of residual: 1.121418
Kurtosis of residual: 17.583522
Skewness of residual_log: -0.122579
Kurtosis of residual_log: 3.177149

Log-transformation reduces skewness and kurtosis of error distribution.

Let's look at the residual distribution of the first proposed base model with all the categorical features.

In [76]:
residual_log (model, df1, 'residual_log_model','residual_model') # Residual statistics for the first proposed base model
median:   0.0005739447185177937
Out[76]:
(count    2.621000e+03
 mean     1.778082e-12
 std      2.149221e-01
 min     -1.377546e+00
 25%     -1.227572e-01
 50%      5.739447e-04
 75%      1.205935e-01
 max      1.128255e+00
 Name: residual_log_model, dtype: float64, None)
In [77]:
percentage_error_projects (df1, 'residual_model', 'percentage_error_model','percentage_error_interval_model')
Out[77]:
percentage_error_interval_model project_number percentage_projects
0 (-inf, -30.0] 221 8.431896
1 (-30.0, -20.0] 188 7.172835
2 (-20.0, -10.0] 398 15.185044
3 (-10.0, 0.0] 502 19.152995
4 (0.0, 10.0] 590 22.510492
5 (10.0, 20.0] 425 16.215185
6 (20.0, 30.0] 180 6.867608
7 (30.0, inf] 117 4.463945
In [78]:
#Mean Absolute Percentage Error of first proposed base model
np.absolute(df1['percentage_error_model']).mean()
Out[78]:
16.142467316800943
In [79]:
sns.FacetGrid(df1, col="wap_interval").map(plt.hist, "residual_model")
Out[79]:
<seaborn.axisgrid.FacetGrid at 0x1a5affdbb38>

Smaller instances of prediction errors at higher prices.

In [80]:
import scipy.stats as stats

fig,ax=plt.subplots(figsize=(10,10))
#Cook's Distance Plot for log transformed base model
fig=sm.graphics.influence_plot(mod, ax=ax, color='b')
#QQ Plot for absolute residual of log-transformed base model showing deviation from normal distribution
fig = sm.qqplot(df1['residual'], fit=True, line='45')
plt.show()

Cook's distance identifies outliers and leverage points in observations that shows their influence on regression model. QQ normal plot of a quantity signifies how well it fits a normal distribution.

In [81]:
fig,ax=plt.subplots(figsize=(10,10))
#Cook's Distance Plot for base model
fig=sm.graphics.influence_plot(mod1, ax=ax, color='b')
#QQ Plot for log-residual of base model showing deviation from normal distribution
fig = sm.qqplot(df1['residual_log'], fit=True, line='45')
plt.show()

Log-transformation of a skewed quantity will definitely aid in greater degree of normalisation. The residuals of the model with log-transformation shows greater resemblance with a normal distribution.

In [82]:
fig,ax=plt.subplots(figsize=(10,10))
#Cook's Distance Plot for first proposed base model
fig=sm.graphics.influence_plot(model, ax=ax, color='b')
#QQ Plot for log-residual of first proposed base model 
fig = sm.qqplot(df1['residual_log_model'], fit=True, line='45')
plt.show()

Therefore, the model 'mod1' will be the base model due to greater degree of normality of its residuals as indicated by qqplot, skewness and kurtosis values in distribution table.

In [83]:
# Residual plots involving continuous variables in based model
f, ax = plt.subplots(2,2 ,figsize=(25,10))
sns.residplot(x="max_size", y='residual_log', data=df1, color="r",ax=ax[0,0])
sns.residplot(x="bp1", y='residual_log', data=df1, color="black",ax=ax[0,1])
sns.residplot(x="bp2", y='residual_log', data=df1, color="y",ax=ax[1,0])
sns.residplot(x="bp3", y='residual_log', data=df1, color="green",ax=ax[1,1])
Out[83]:
<matplotlib.axes._subplots.AxesSubplot at 0x1a5b2b2a4a8>

3.2 Base Model Cross Validation

In [84]:
include=['log_wap','max_size','mm_cluster', 'bp1','bp2','bp3', 'unit_type','proj_bedrooms_clean','launch_year']
df1_cv=pd.DataFrame(df1[include])
type_converter (df1_cv, 'mm_cluster', 'category')

df1_cv.head()
Out[84]:
log_wap max_size mm_cluster bp1 bp2 bp3 unit_type proj_bedrooms_clean launch_year
0 9.579418 744.75 16 0.954955 0.045045 0.000000 Apartment 1 and more 2015
1 9.545526 1003.30 16 0.000000 0.653333 0.346667 Apartment 2 and more 2014
2 9.545526 1003.30 16 0.000000 0.653333 0.346667 Apartment 2 and more 2014
3 9.461644 590.51 16 0.891667 0.108333 0.000000 Apartment 1 and more 2016
4 9.739884 1717.00 16 0.030303 0.666667 0.303030 Apartment 1 and more 2015
In [85]:
from sklearn.preprocessing import OneHotEncoder
#create an instance of one-hot encoding class
onehot_encoder = OneHotEncoder(sparse=False)
colmn=['mm_cluster','unit_type','proj_bedrooms_clean','launch_year']
onehot_encoded = onehot_encoder.fit_transform(df1_cv[colmn])
#Create and concatenate new features formed from one-hot encoding of categorical features
onehot_encoded=pd.DataFrame(onehot_encoded)
df1_cv=pd.concat([df1_cv,onehot_encoded],axis=1)
df1_cv=df1_cv.drop(colmn,axis=1)
df1_cv.head()
Out[85]:
log_wap max_size bp1 bp2 bp3 0 1 2 3 4 ... 22 23 24 25 26 27 28 29 30 31
0 9.579418 744.75 0.954955 0.045045 0.000000 0.0 0.0 0.0 0.0 0.0 ... 1.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0
1 9.545526 1003.30 0.000000 0.653333 0.346667 0.0 0.0 0.0 0.0 0.0 ... 0.0 1.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0
2 9.545526 1003.30 0.000000 0.653333 0.346667 0.0 0.0 0.0 0.0 0.0 ... 0.0 1.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0
3 9.461644 590.51 0.891667 0.108333 0.000000 0.0 0.0 0.0 0.0 0.0 ... 1.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0
4 9.739884 1717.00 0.030303 0.666667 0.303030 0.0 0.0 0.0 0.0 0.0 ... 1.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0

5 rows × 37 columns

In [86]:
df1_cv.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2621 entries, 0 to 2620
Data columns (total 37 columns):
log_wap     2621 non-null float64
max_size    2621 non-null float64
bp1         2621 non-null float64
bp2         2621 non-null float64
bp3         2621 non-null float64
0           2621 non-null float64
1           2621 non-null float64
2           2621 non-null float64
3           2621 non-null float64
4           2621 non-null float64
5           2621 non-null float64
6           2621 non-null float64
7           2621 non-null float64
8           2621 non-null float64
9           2621 non-null float64
10          2621 non-null float64
11          2621 non-null float64
12          2621 non-null float64
13          2621 non-null float64
14          2621 non-null float64
15          2621 non-null float64
16          2621 non-null float64
17          2621 non-null float64
18          2621 non-null float64
19          2621 non-null float64
20          2621 non-null float64
21          2621 non-null float64
22          2621 non-null float64
23          2621 non-null float64
24          2621 non-null float64
25          2621 non-null float64
26          2621 non-null float64
27          2621 non-null float64
28          2621 non-null float64
29          2621 non-null float64
30          2621 non-null float64
31          2621 non-null float64
dtypes: float64(37)
memory usage: 757.7 KB
In [87]:
from sklearn.model_selection import train_test_split

target = 'log_wap'
features = df1_cv.columns != target
x = df1_cv.loc[:,features]
y = df1_cv[target]
x_train, x_test, y_train, y_test = train_test_split(x, y, random_state=8)
print(x_train.shape, x_test.shape, y_train.shape, y_test.shape)
(1965, 36) (656, 36) (1965,) (656,)
In [88]:
from sklearn.model_selection import cross_validate
from sklearn.linear_model import LinearRegression

reg = LinearRegression().fit(x_train,y_train)
# cross validate
scores_reg = cross_validate(reg, x_train, y_train, cv=8, return_train_score=True,return_estimator=False)
scores_reg =pd.DataFrame(scores_reg)
scores_reg 
Out[88]:
fit_time score_time test_score train_score
0 0.004001 0.004000 0.909530 0.925739
1 0.004000 0.004000 0.923648 0.923895
2 0.004002 0.000000 0.933639 0.922384
3 0.000000 0.003999 0.915586 0.924915
4 0.004004 0.000000 0.936407 0.922006
5 0.004001 0.000000 0.917104 0.924721
6 0.003999 0.004000 0.911804 0.925412
7 0.007999 0.000000 0.916359 0.924815
In [89]:
print("Linear Regression Base Model Train Score:",scores_reg ['train_score'].mean())
print("Linear Regression Base Model Test Score:",scores_reg ['test_score'].mean())
Linear Regression Base Model Train Score: 0.924235767164227
Linear Regression Base Model Test Score: 0.9205098508699949

4. Cluster Analysis of Amenities

In [90]:
amenities=df.loc[:,'am_water':'am_water_supply'] # yes-counts of individual amenities
amenities.count().sort_values(ascending=False)
Out[90]:
am_closed_carpark     1441
am_lift               1281
am_oth                1230
am_power_backup       1125
am_carpark            1112
am_ls_tree            1082
am_security           1062
am_ls_garden          1059
am_ff_req             1031
am_elec_meter_room    1007
am_water_cons         1003
am_storm_drain         990
am_open_carpark        928
am_jog                 834
am_vaastu              829
am_open_parking        827
am_streetlight         781
am_open_space          775
am_sold_waste          759
am_rain_harvest        758
am_intercom            757
am_energy_mgmt         756
am_maint_staff         725
am_swr_trt_slug        716
am_club                674
am_indoor_games        649
am_commu_building      644
am_swr_chamber         576
am_child_play          546
am_school              532
                      ... 
am_cctv                206
am_ff_system           205
am_swer_trt_plt        124
am_water               112
am_spa                 103
am_amphi               101
am_tennis               97
am_badmintorn           95
am_lib                  93
am_basketball           89
am_internet             82
am_squash               81
am_temple               75
am_roads                67
am_skating              62
am_rec_facility         60
am_banquet              53
am_cric_pitch           47
am_party_hall           45
am_commu_hall           39
am_health               35
am_earthquake           29
am_aerobics             21
am_service_lift         19
am_business_center      17
am_gated                13
am_transformer           8
am_utility_shop          8
am_ht_alarm              6
am_security_guards       3
Length: 72, dtype: int64
In [91]:
amenities=amenities.fillna("No")
amenities.head()
Out[91]:
am_water am_security am_aerobics am_open_space am_amphi am_atm am_badmintorn am_banquet am_basketball am_business_center ... am_storm_drain am_streetlight am_swim am_temple am_tennis am_swr_trt_slug am_utility_shop am_vaastu am_water_cons am_water_supply
0 No No No No No No No No No No ... No No No No No No No No No No
1 No Yes No No No No No No No No ... No No No No No No No No No No
2 No No No No No No No No No No ... No No No No No No No No No No
3 No No No Yes No No No No No No ... Yes Yes No No No Yes No No Yes No
4 No Yes No No No Yes No No No No ... No No No No No No No Yes No No

5 rows × 72 columns

In [92]:
def label_encoder(dataframe, col_name): # Function for label encoding categorical variables
    from sklearn.preprocessing import LabelEncoder
    le = LabelEncoder()
    le.fit(dataframe[col_name].unique())
    dataframe[col_name] = le.transform(dataframe[col_name])
In [93]:
for i in list(amenities.columns): # Label-encoding of amenities features
    label_encoder(amenities, i)
In [94]:
amenities['am_water']=amenities['am_water']+amenities['am_water_cons']+amenities['am_water_supply']+amenities['am_rain_harvest']
amenities['am_security']=amenities['am_security']+amenities['am_security_guards']+amenities['am_intercom']+amenities['am_cctv']
amenities['am_outdoor_game']=amenities['am_badmintorn']+amenities['am_basketball']+amenities['am_cric_pitch']+amenities['am_golf']+amenities['am_sports']+amenities['am_tennis']+amenities['am_swim']+amenities['am_jog']
amenities['am_indoor_game']=amenities['am_aerobics']+amenities['am_gym']+amenities['am_skating']+amenities['am_indoor_games']+amenities['am_squash']+amenities['am_spa']
amenities['am_sewage']=amenities['am_swer_trt_plt']+amenities['am_swr_chamber']+amenities['am_sold_waste']+amenities['am_storm_drain']+amenities['am_swr_trt_slug']
amenities['am_roads']=amenities['am_roads']+amenities['am_road_footpath']+amenities['am_streetlight']
amenities['am_cyber']=amenities['am_internet']
amenities['am_power']=amenities['am_energy_mgmt']+amenities['am_elec_meter_room']+amenities['am_ht_alarm']+amenities['am_transformer']+amenities['am_earthquake']+amenities['am_power_backup']
amenities['am_religion']=amenities['am_temple']+amenities['am_vaastu']
amenities['am_space']=amenities['am_ls_garden']+amenities['am_ls_tree']+amenities['am_child_play']+amenities['am_open_space']+amenities['am_amphi']+amenities['am_rec_facility']
amenities['am_finance']=amenities['am_atm']+amenities['am_business_center']
amenities['am_firefighting']=amenities['am_ff_system']+amenities['am_ff_req']
amenities['am_health_related']=amenities['am_health']+amenities['am_hosp']
amenities['am_car_parking']=amenities['am_open_carpark']+amenities['am_open_parking']+amenities['am_closed_carpark']+amenities['am_carpark']
amenities['am_education']=amenities['am_lib']+amenities['am_oth']+amenities['am_school']
amenities['am_maintenance']=amenities['am_maint_staff']+amenities['am_gated']+amenities['am_staffq']
amenities['am_social_gatherings']=amenities['am_commu_building']+amenities['am_commu_hall']+amenities['am_banquet']+amenities['am_party_hall']+amenities['am_club']+amenities['am_multi_room']+amenities['am_cafeteria']
amenities['am_lift']=amenities['am_lift']+amenities['am_service_lift']
amenities['am_shop']=amenities['am_shopmall']+amenities['am_utility_shop']
In [95]:
for i in ['am_water','am_security','am_outdoor_game','am_indoor_game','am_sewage','am_roads','am_cyber','am_power','am_religion','am_space','am_finance','am_firefighting','am_health_related','am_car_parking','am_education','am_maintenance','am_social_gatherings','am_lift','am_shop']:
    amenities[i]=amenities[i].replace([2,3,4,5,6,7],[1,1,1,1,1,1]) # Using OR logic, replace counts of amenities in each 
    # cluster greater than one by 1 to signify if any amenity is present or not
    
In [96]:
for i in list(amenities.columns): # Convert amenities cluster features from integer to categorical
    type_converter (amenities, i, 'category')
In [97]:
df1=pd.concat([df1,amenities],axis=1)
df1.head()
Out[97]:
zone micro_market_number developer_number project_number wap log_wap proj_bedrooms_clean unit_type proj_launched_units launch_year ... am_religion am_space am_finance am_firefighting am_health_related am_car_parking am_education am_maintenance am_social_gatherings am_shop
0 Zone A 1 1 1 14464.0000 9.579418 1 and more Apartment 111 2015 ... 0 0 0 0 0 1 0 0 0 0
1 Zone A 1 2 2 13982.0000 9.545526 2 and more Apartment 150 2014 ... 0 1 0 0 0 1 1 0 1 0
2 Zone A 1 2 3 13982.0000 9.545526 2 and more Apartment 150 2014 ... 0 1 0 0 0 1 1 0 1 0
3 Zone A 1 5 6 12857.0000 9.461644 1 and more Apartment 120 2016 ... 0 1 0 1 0 1 0 0 1 0
4 Zone A 1 7 8 16981.5758 9.739884 1 and more Apartment 33 2015 ... 1 1 1 1 0 1 1 1 1 0

5 rows × 129 columns

5. Analysis of Specifications

In [98]:
specifications=df.loc[:,'sp_balcony':'sp_cement'] # Yes-counts of specifications
specifications.count().sort_values(ascending=False)
Out[98]:
sp_kitchen     1163
sp_toilets     1140
sp_mas_bed     1132
sp_oth_bed     1045
sp_windows      875
sp_main         726
sp_interior     723
sp_exterior     645
sp_internal     621
sp_balcony      597
sp_livdin       372
sp_points       356
sp_wiring       331
sp_switches     165
sp_lobby         58
sp_frame         22
sp_cement         6
dtype: int64
In [99]:
specifications=specifications.fillna('No')
specifications.head()
Out[99]:
sp_balcony sp_kitchen sp_livdin sp_mas_bed sp_oth_bed sp_toilets sp_internal sp_main sp_interior sp_switches sp_windows sp_wiring sp_exterior sp_frame sp_points sp_lobby sp_cement
0 No No No No No No No No No No No No No No No No No
1 No Vitrified Tiles No Laminated Wooden Vitrified Tiles Glazed Tiles Dado up to 7 Feet Height Above Pl... Veneer Finish Flush Shutter Teak Veneered / Laminated Flush Door No No UPVC Windows with Granite Sills No No No No No No
2 No Vitrified Tiles No Laminated Wooden Vitrified Tiles Concealed Plumbing,Glazed Tiles Dado up to 7 F... Veneer Finish Flush Shutter Teak Veneered / Laminated Flush Door No No UPVC Windows with Granite Sills No No No No No No
3 No No No No No No No No No No No No No No No No No
4 Vitrified Tiles Moduler Kitchen with Refrigerator,Vitrified Tiles No Wooden Flooring Vitrified Tiles Branded Sanitary Fittings,Ceramic Tiles Dado u... Flush Shutters Teak Wood Frame No No French Windows No No No No No No
In [100]:
specifications.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2621 entries, 0 to 2620
Data columns (total 17 columns):
sp_balcony     2621 non-null object
sp_kitchen     2621 non-null object
sp_livdin      2621 non-null object
sp_mas_bed     2621 non-null object
sp_oth_bed     2621 non-null object
sp_toilets     2621 non-null object
sp_internal    2621 non-null object
sp_main        2621 non-null object
sp_interior    2621 non-null object
sp_switches    2621 non-null object
sp_windows     2621 non-null object
sp_wiring      2621 non-null object
sp_exterior    2621 non-null object
sp_frame       2621 non-null object
sp_points      2621 non-null object
sp_lobby       2621 non-null object
sp_cement      2621 non-null object
dtypes: object(17)
memory usage: 348.2+ KB
In [101]:
for i in list(specifications.columns):
    specifications[i]=specifications[i].str.contains('No')
In [102]:
for i in list(specifications.columns):
    label_encoder(specifications, i)
    specifications[i] = specifications[i].replace([0,1],[1,0])
In [103]:
for i in list(specifications.columns):
    type_converter (specifications, i, 'category')
In [104]:
df1=pd.concat([df1,specifications],axis=1)
df1.head()
Out[104]:
zone micro_market_number developer_number project_number wap log_wap proj_bedrooms_clean unit_type proj_launched_units launch_year ... sp_main sp_interior sp_switches sp_windows sp_wiring sp_exterior sp_frame sp_points sp_lobby sp_cement
0 Zone A 1 1 1 14464.0000 9.579418 1 and more Apartment 111 2015 ... 0 0 0 0 0 0 0 0 0 0
1 Zone A 1 2 2 13982.0000 9.545526 2 and more Apartment 150 2014 ... 1 0 0 1 0 0 0 0 0 0
2 Zone A 1 2 3 13982.0000 9.545526 2 and more Apartment 150 2014 ... 1 0 0 1 0 0 0 0 0 0
3 Zone A 1 5 6 12857.0000 9.461644 1 and more Apartment 120 2016 ... 0 0 0 0 0 0 0 0 0 0
4 Zone A 1 7 8 16981.5758 9.739884 1 and more Apartment 33 2015 ... 1 0 0 1 0 0 0 0 0 0

5 rows × 146 columns

6. Including in Amenities Clusters & Specifications inside Model

Let's add all the amenities clusters and specifications features to base model and evaluate performance.

In [105]:
mod2 = smf.ols(formula='log_wap ~ max_size + C(mm_cluster) + bp1 + bp2 + bp3 + C(launch_year) + C(proj_bedrooms_clean) + C(unit_type) + C(am_water)+ C(am_security)+ C(am_indoor_game) + C(am_outdoor_game)+ C(am_sewage) + C(am_roads) + C(am_cyber) + C(am_power) + C(am_religion) + C(am_space) + C(am_finance) + C(am_firefighting) + C(am_health_related) + C(am_car_parking) + C(am_education) + C(am_maintenance) + C(am_social_gatherings) + C(am_lift) + C(am_shop) + C(sp_balcony) + C(sp_kitchen) + C(sp_livdin) + C(sp_mas_bed) + C(sp_oth_bed) + C(sp_toilets) + C(sp_internal) + C(sp_main) + C(sp_interior) + C(sp_switches) + C(sp_windows) + C(sp_wiring) + C(sp_exterior) + C(sp_frame) + C(sp_points) + C(sp_lobby) + C(sp_cement)', data=df1).fit()
print(mod2.summary())
                            OLS Regression Results                            
==============================================================================
Dep. Variable:                log_wap   R-squared:                       0.926
Model:                            OLS   Adj. R-squared:                  0.924
Method:                 Least Squares   F-statistic:                     468.2
Date:                Mon, 11 Mar 2019   Prob (F-statistic):               0.00
Time:                        02:03:00   Log-Likelihood:                 352.15
No. Observations:                2621   AIC:                            -566.3
Df Residuals:                    2552   BIC:                            -161.2
Df Model:                          68                                         
Covariance Type:            nonrobust                                         
========================================================================================================
                                           coef    std err          t      P>|t|      [0.025      0.975]
--------------------------------------------------------------------------------------------------------
Intercept                                9.3679      0.072    130.220      0.000       9.227       9.509
C(mm_cluster)[T.1]                       1.0643      0.043     24.790      0.000       0.980       1.148
C(mm_cluster)[T.2]                      -0.7322      0.019    -39.020      0.000      -0.769      -0.695
C(mm_cluster)[T.3]                       0.5173      0.024     21.185      0.000       0.469       0.565
C(mm_cluster)[T.4]                       1.1755      0.088     13.356      0.000       1.003       1.348
C(mm_cluster)[T.5]                       0.9702      0.036     27.125      0.000       0.900       1.040
C(mm_cluster)[T.6]                       0.3788      0.021     18.148      0.000       0.338       0.420
C(mm_cluster)[T.7]                       1.3603      0.062     21.915      0.000       1.239       1.482
C(mm_cluster)[T.8]                       1.1639      0.038     30.455      0.000       1.089       1.239
C(mm_cluster)[T.9]                      -1.0641      0.018    -58.180      0.000      -1.100      -1.028
C(mm_cluster)[T.10]                     -0.1053      0.052     -2.037      0.042      -0.207      -0.004
C(mm_cluster)[T.11]                      0.8624      0.041     20.859      0.000       0.781       0.943
C(mm_cluster)[T.12]                     -0.2782      0.024    -11.499      0.000      -0.326      -0.231
C(mm_cluster)[T.13]                      0.6897      0.060     11.482      0.000       0.572       0.807
C(mm_cluster)[T.14]                      0.2890      0.024     11.955      0.000       0.242       0.336
C(mm_cluster)[T.15]                      1.1359      0.048     23.815      0.000       1.042       1.229
C(mm_cluster)[T.16]                      0.4862      0.022     22.239      0.000       0.443       0.529
C(mm_cluster)[T.17]                      0.6565      0.029     22.313      0.000       0.599       0.714
C(mm_cluster)[T.18]                      1.0027      0.084     12.000      0.000       0.839       1.167
C(mm_cluster)[T.19]                     -0.5501      0.019    -28.982      0.000      -0.587      -0.513
C(launch_year)[T.2015]                   0.0371      0.012      3.086      0.002       0.014       0.061
C(launch_year)[T.2016]                   0.0575      0.013      4.524      0.000       0.033       0.082
C(launch_year)[T.2017]                   0.0941      0.014      6.747      0.000       0.067       0.121
C(launch_year)[T.2018]                   0.1468      0.023      6.348      0.000       0.101       0.192
C(proj_bedrooms_clean)[T.2 and more]    -0.0042      0.016     -0.264      0.792      -0.035       0.027
C(proj_bedrooms_clean)[T.3 and more]     0.0089      0.033      0.273      0.785      -0.055       0.073
C(proj_bedrooms_clean)[T.4 and more]     0.0738      0.074      0.997      0.319      -0.071       0.219
C(proj_bedrooms_clean)[T.6 and more]    -0.0835      0.128     -0.654      0.513      -0.334       0.167
C(unit_type)[T.Villa]                   -0.0818      0.049     -1.658      0.098      -0.178       0.015
C(am_water)[T.1]                         0.0134      0.014      0.967      0.334      -0.014       0.041
C(am_security)[T.1]                      0.0123      0.012      1.026      0.305      -0.011       0.036
C(am_indoor_game)[T.1]                  -0.0028      0.013     -0.216      0.829      -0.028       0.023
C(am_outdoor_game)[T.1]                 -0.0242      0.014     -1.769      0.077      -0.051       0.003
C(am_sewage)[T.1]                       -0.0342      0.019     -1.834      0.067      -0.071       0.002
C(am_roads)[T.1]                        -0.0230      0.016     -1.450      0.147      -0.054       0.008
C(am_cyber)[T.1]                        -0.0862      0.026     -3.298      0.001      -0.137      -0.035
C(am_power)[T.1]                        -0.0074      0.014     -0.549      0.583      -0.034       0.019
C(am_religion)[T.1]                      0.0091      0.014      0.638      0.523      -0.019       0.037
C(am_space)[T.1]                         0.0071      0.015      0.457      0.648      -0.023       0.037
C(am_finance)[T.1]                      -0.0206      0.024     -0.870      0.384      -0.067       0.026
C(am_firefighting)[T.1]                  0.0625      0.015      4.218      0.000       0.033       0.092
C(am_health_related)[T.1]               -0.0051      0.026     -0.198      0.843      -0.056       0.045
C(am_car_parking)[T.1]                   0.0022      0.015      0.149      0.882      -0.027       0.031
C(am_education)[T.1]                     0.0300      0.013      2.365      0.018       0.005       0.055
C(am_maintenance)[T.1]                   0.0067      0.015      0.440      0.660      -0.023       0.036
C(am_social_gatherings)[T.1]            -0.0374      0.012     -2.993      0.003      -0.062      -0.013
C(am_lift)[T.1]                          0.0184      0.012      1.494      0.135      -0.006       0.043
C(am_shop)[T.1]                          0.0274      0.022      1.262      0.207      -0.015       0.070
C(sp_balcony)[T.1]                      -0.0111      0.013     -0.834      0.404      -0.037       0.015
C(sp_kitchen)[T.1]                       0.0051      0.027      0.192      0.848      -0.047       0.058
C(sp_livdin)[T.1]                        0.0179      0.015      1.185      0.236      -0.012       0.048
C(sp_mas_bed)[T.1]                      -0.0433      0.030     -1.432      0.152      -0.103       0.016
C(sp_oth_bed)[T.1]                       0.0416      0.024      1.714      0.087      -0.006       0.089
C(sp_toilets)[T.1]                       0.0114      0.024      0.469      0.639      -0.036       0.059
C(sp_internal)[T.1]                     -0.0126      0.015     -0.819      0.413      -0.043       0.018
C(sp_main)[T.1]                          0.0074      0.016      0.467      0.641      -0.024       0.039
C(sp_interior)[T.1]                      0.0157      0.015      1.044      0.297      -0.014       0.045
C(sp_switches)[T.1]                     -0.0065      0.021     -0.314      0.753      -0.047       0.034
C(sp_windows)[T.1]                      -0.0239      0.015     -1.553      0.121      -0.054       0.006
C(sp_wiring)[T.1]                       -0.0118      0.016     -0.720      0.471      -0.044       0.020
C(sp_exterior)[T.1]                      0.0006      0.015      0.041      0.967      -0.028       0.030
C(sp_frame)[T.1]                        -0.1224      0.055     -2.225      0.026      -0.230      -0.015
C(sp_points)[T.1]                       -0.0008      0.015     -0.058      0.954      -0.029       0.028
C(sp_lobby)[T.1]                        -0.0014      0.032     -0.044      0.965      -0.063       0.061
C(sp_cement)[T.1]                        0.4781      0.111      4.300      0.000       0.260       0.696
max_size                              1.373e-05    1.1e-05      1.249      0.212   -7.83e-06    3.53e-05
bp1                                     -0.3791      0.067     -5.633      0.000      -0.511      -0.247
bp2                                     -0.2456      0.064     -3.835      0.000      -0.371      -0.120
bp3                                     -0.1069      0.061     -1.767      0.077      -0.226       0.012
==============================================================================
Omnibus:                      209.676   Durbin-Watson:                   1.727
Prob(Omnibus):                  0.000   Jarque-Bera (JB):             1155.513
Skew:                          -0.111   Prob(JB):                    1.21e-251
Kurtosis:                       6.245   Cond. No.                     3.76e+04
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 3.76e+04. This might indicate that there are
strong multicollinearity or other numerical problems.
In [106]:
residual_log (mod2, df1, 'residual_log_mod2', 'residual_mod2')
median:   0.0032167519683774515
Out[106]:
(count    2.621000e+03
 mean     1.151898e-11
 std      2.115904e-01
 min     -1.332673e+00
 25%     -1.255561e-01
 50%      3.216752e-03
 75%      1.154965e-01
 max      1.103403e+00
 Name: residual_log_mod2, dtype: float64, None)
In [107]:
percentage_error_projects (df1, 'residual_mod2','percentage_error_mod2', 'percentage_error_interval_mod2')
Out[107]:
percentage_error_interval_mod2 project_number percentage_projects
0 (-inf, -30.0] 209 7.974056
1 (-30.0, -20.0] 213 8.126669
2 (-20.0, -10.0] 366 13.964136
3 (-10.0, 0.0] 503 19.191148
4 (0.0, 10.0] 633 24.151087
5 (10.0, 20.0] 397 15.146890
6 (20.0, 30.0] 188 7.172835
7 (30.0, inf] 112 4.273178
In [108]:
# Mean Absolute Percentage Error for model including amenities specifications and amenities
np.absolute(df1['percentage_error_mod2']).mean()
Out[108]:
15.945578609580595
In [109]:
fig,ax=plt.subplots(figsize=(10,10))
#Cook's Distance Plot for model including specifications & amenities
fig=sm.graphics.influence_plot(mod2, ax=ax, color='b')
#QQ Plot for absolute residual of model including specifications & amenities
fig = sm.qqplot(df1['residual_log_mod2'], fit=True, line='45')
plt.show()
In [110]:
g = sns.FacetGrid(df1, col="wap_interval")
g = g.map(plt.hist, "residual_mod2").set(ylim=(0, 600))

7. Dependence Between Base Model Residual with Amenities & Specifications

7.1 Chi-Square Test Between Base Model Residual & Amenities & Specifications

A two way chi-square test

In [111]:
df1['wap_level']=0
df1['residual_abs_level']=df1['residual_abs']=np.absolute(df1['residual'])

df1.loc[df1['wap'] > 20000, 'wap_level']= "High" #Splitting wap into labelled categories
df1.loc[(df1['wap'] >= 4000) & (df1['wap'] <= 20000),'wap_level']= "Medium"
df1.loc[df1['wap'] < 4000, 'wap_level']= "Low"

#Splitting absolute residual values of base model into categories
df1.loc[df1['residual_abs'] > 2500, 'residual_abs_level']= "High" 
df1.loc[(df1['residual_abs'] >= 300) & (df1['residual_abs'] <= 2500),'residual_abs_level']= "Medium"
df1.loc[(df1['residual_abs'] < 300), 'residual_abs_level']= "Low"

Most projects fall under medium category for both prices and residual level.

In [112]:
df1['residual_abs_level'].value_counts()
Out[112]:
Medium    1462
Low        607
High       552
Name: residual_abs_level, dtype: int64
In [113]:
df1['wap_level'].value_counts()
Out[113]:
Medium    1648
Low        679
High       294
Name: wap_level, dtype: int64

Chi-square test involving amenities clusters and absolute residual level.

In [114]:
print("-----Chi-Square Two-Way P-Values of amenities clusters vs residual of base model-----")
for i in ['am_water','am_security','am_outdoor_game','am_indoor_game','am_sewage','am_roads','am_cyber','am_power','am_religion','am_space','am_finance','am_firefighting','am_health_related','am_car_parking','am_education','am_maintenance','am_social_gatherings','am_lift','am_shop']:
    cont=pd.crosstab(df1['residual_abs_level'],df1[i]) #Contingency table
    g, p, dof, expctd = chi2_contingency(cont)
    print(i,": ", p)
-----Chi-Square Two-Way P-Values of amenities clusters vs residual of base model-----
am_water :  0.047398451441896305
am_security :  0.005324528404578851
am_outdoor_game :  0.504947422435368
am_indoor_game :  0.09644475210655401
am_sewage :  0.4682319790791365
am_roads :  0.06152782070986555
am_cyber :  2.8095793559777583e-05
am_power :  0.12236299711527024
am_religion :  0.5800822102273938
am_space :  0.47951077721353375
am_finance :  0.00498850205391336
am_firefighting :  0.28496938498528107
am_health_related :  0.003161331138010132
am_car_parking :  0.2310388181554251
am_education :  0.0029196924114436285
am_maintenance :  0.1846776283133462
am_social_gatherings :  0.7634971059945903
am_lift :  0.8457372366394984
am_shop :  0.02038505969475756

Chose top 6 clusters with lowest p-values: am_cyber, am_education, am_finance, am_health_related, am_indoor_game, am_security

Now, do a similar chi-square test for specifications.

In [115]:
print("-----Chi-Square Two-Way P-Values of specifications vs residual of base model-----")
for i in list(specifications.columns):
    cont=pd.crosstab(df1['residual_abs_level'],df1[i])
    g, p, dof, expctd = chi2_contingency(cont) # Continency table
    print(i,": ", p)
-----Chi-Square Two-Way P-Values of specifications vs residual of base model-----
sp_balcony :  0.2017050047333319
sp_kitchen :  0.5097959916494637
sp_livdin :  0.02026124920585448
sp_mas_bed :  0.23898619759682838
sp_oth_bed :  0.5549126407103944
sp_toilets :  0.21587842875536875
sp_internal :  0.0728487444974084
sp_main :  0.02661453354002821
sp_interior :  0.07252279049237388
sp_switches :  0.038998777848094364
sp_windows :  0.0036762596844299627
sp_wiring :  0.033593176565877854
sp_exterior :  0.00044941132658327755
sp_frame :  0.01816077583452707
sp_points :  0.008797213119887046
sp_lobby :  0.19499852879525922
sp_cement :  1.2745078421583801e-05

At 1% significance level, the following specifications are dependent on base model residuals:

sp_interior, sp_main, sp_exterior, sp_points, sp_cement

Let's reduce the number of amenities and specifications to selected few to the base model.

In [116]:
selected = smf.ols(formula='log_wap ~ max_size + C(mm_cluster) + C(proj_bedrooms_clean) + bp1 + bp2 + bp3 + C(launch_year) + C(unit_type) + C(am_cyber) + C(am_education) + C(am_finance) + C(am_security) + C(am_health_related) + C(am_indoor_game) + C(sp_interior) + C(sp_main) + C(sp_exterior) + C(sp_points) + C(sp_cement)', data=df1).fit()
print(selected.summary())
                            OLS Regression Results                            
==============================================================================
Dep. Variable:                log_wap   R-squared:                       0.924
Model:                            OLS   Adj. R-squared:                  0.923
Method:                 Least Squares   F-statistic:                     729.6
Date:                Mon, 11 Mar 2019   Prob (F-statistic):               0.00
Time:                        02:03:34   Log-Likelihood:                 322.58
No. Observations:                2621   AIC:                            -557.2
Df Residuals:                    2577   BIC:                            -298.8
Df Model:                          43                                         
Covariance Type:            nonrobust                                         
========================================================================================================
                                           coef    std err          t      P>|t|      [0.025      0.975]
--------------------------------------------------------------------------------------------------------
Intercept                                9.3347      0.071    131.283      0.000       9.195       9.474
C(mm_cluster)[T.1]                       1.0732      0.043     25.018      0.000       0.989       1.157
C(mm_cluster)[T.2]                      -0.7282      0.019    -38.759      0.000      -0.765      -0.691
C(mm_cluster)[T.3]                       0.5240      0.024     21.516      0.000       0.476       0.572
C(mm_cluster)[T.4]                       1.1554      0.088     13.104      0.000       0.983       1.328
C(mm_cluster)[T.5]                       0.9704      0.036     27.139      0.000       0.900       1.041
C(mm_cluster)[T.6]                       0.3786      0.021     18.221      0.000       0.338       0.419
C(mm_cluster)[T.7]                       1.3538      0.062     21.856      0.000       1.232       1.475
C(mm_cluster)[T.8]                       1.1719      0.038     30.651      0.000       1.097       1.247
C(mm_cluster)[T.9]                      -1.0719      0.018    -58.921      0.000      -1.108      -1.036
C(mm_cluster)[T.10]                     -0.0916      0.052     -1.775      0.076      -0.193       0.010
C(mm_cluster)[T.11]                      0.8719      0.041     21.049      0.000       0.791       0.953
C(mm_cluster)[T.12]                     -0.2752      0.024    -11.373      0.000      -0.323      -0.228
C(mm_cluster)[T.13]                      0.6784      0.060     11.318      0.000       0.561       0.796
C(mm_cluster)[T.14]                      0.2867      0.024     11.898      0.000       0.239       0.334
C(mm_cluster)[T.15]                      1.1423      0.048     23.911      0.000       1.049       1.236
C(mm_cluster)[T.16]                      0.4858      0.022     22.360      0.000       0.443       0.528
C(mm_cluster)[T.17]                      0.6640      0.029     22.599      0.000       0.606       0.722
C(mm_cluster)[T.18]                      1.0071      0.084     12.053      0.000       0.843       1.171
C(mm_cluster)[T.19]                     -0.5466      0.019    -28.914      0.000      -0.584      -0.510
C(proj_bedrooms_clean)[T.2 and more]    -0.0031      0.016     -0.198      0.843      -0.034       0.028
C(proj_bedrooms_clean)[T.3 and more]     0.0195      0.033      0.597      0.551      -0.045       0.084
C(proj_bedrooms_clean)[T.4 and more]     0.0898      0.074      1.216      0.224      -0.055       0.235
C(proj_bedrooms_clean)[T.6 and more]    -0.0848      0.128     -0.664      0.507      -0.335       0.166
C(launch_year)[T.2015]                   0.0373      0.012      3.137      0.002       0.014       0.061
C(launch_year)[T.2016]                   0.0586      0.012      4.783      0.000       0.035       0.083
C(launch_year)[T.2017]                   0.0937      0.013      7.033      0.000       0.068       0.120
C(launch_year)[T.2018]                   0.1475      0.022      6.631      0.000       0.104       0.191
C(unit_type)[T.Villa]                   -0.1102      0.049     -2.237      0.025      -0.207      -0.014
C(am_cyber)[T.1]                        -0.0994      0.025     -3.912      0.000      -0.149      -0.050
C(am_education)[T.1]                     0.0411      0.011      3.632      0.000       0.019       0.063
C(am_finance)[T.1]                      -0.0114      0.023     -0.503      0.615      -0.056       0.033
C(am_security)[T.1]                      0.0190      0.010      1.838      0.066      -0.001       0.039
C(am_health_related)[T.1]                0.0081      0.023      0.356      0.722      -0.037       0.053
C(am_indoor_game)[T.1]                  -0.0262      0.011     -2.362      0.018      -0.048      -0.004
C(sp_interior)[T.1]                      0.0124      0.014      0.893      0.372      -0.015       0.040
C(sp_main)[T.1]                         -0.0063      0.013     -0.486      0.627      -0.032       0.019
C(sp_exterior)[T.1]                     -0.0033      0.014     -0.226      0.822      -0.032       0.025
C(sp_points)[T.1]                       -0.0046      0.014     -0.319      0.750      -0.033       0.023
C(sp_cement)[T.1]                        0.3313      0.091      3.632      0.000       0.152       0.510
max_size                              2.073e-05   1.09e-05      1.908      0.056   -5.73e-07     4.2e-05
bp1                                     -0.3545      0.067     -5.300      0.000      -0.486      -0.223
bp2                                     -0.2233      0.064     -3.503      0.000      -0.348      -0.098
bp3                                     -0.0892      0.060     -1.477      0.140      -0.208       0.029
==============================================================================
Omnibus:                      207.102   Durbin-Watson:                   1.705
Prob(Omnibus):                  0.000   Jarque-Bera (JB):             1134.343
Skew:                          -0.102   Prob(JB):                    4.79e-247
Kurtosis:                       6.216   Cond. No.                     3.71e+04
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 3.71e+04. This might indicate that there are
strong multicollinearity or other numerical problems.
In [117]:
residual_log (mod2, df1, 'residual_log_selected', 'residual_selected')
median:   0.0032167519683774515
Out[117]:
(count    2.621000e+03
 mean     1.151898e-11
 std      2.115904e-01
 min     -1.332673e+00
 25%     -1.255561e-01
 50%      3.216752e-03
 75%      1.154965e-01
 max      1.103403e+00
 Name: residual_log_selected, dtype: float64, None)
In [118]:
percentage_error_projects (df1, 'residual_selected','percentage_error_selected', 'percentage_error_interval_selected')
Out[118]:
percentage_error_interval_selected project_number percentage_projects
0 (-inf, -30.0] 209 7.974056
1 (-30.0, -20.0] 213 8.126669
2 (-20.0, -10.0] 366 13.964136
3 (-10.0, 0.0] 503 19.191148
4 (0.0, 10.0] 633 24.151087
5 (10.0, 20.0] 397 15.146890
6 (20.0, 30.0] 188 7.172835
7 (30.0, inf] 112 4.273178
In [119]:
# Mean Absolute Percentage Error for model with selected amenities specifications and amenities
np.absolute(df1['percentage_error_selected']).mean()
Out[119]:
15.945578609580595
In [120]:
 sns.FacetGrid(df1, col="wap_interval").map(plt.hist, "residual_selected")
Out[120]:
<seaborn.axisgrid.FacetGrid at 0x1a5b24a5780>

It's good that clusters of micromarkets are able to explain price variations.

In [121]:
sns.relplot(x="wap", y="percentage_error_selected", hue="zone", data=df1)
sns.relplot(x="wap", y="percentage_error_selected", hue="wap_level", data=df1)
sns.relplot(x="wap", y="percentage_error_selected", hue="residual_abs_level", data=df1)
Out[121]:
<seaborn.axisgrid.FacetGrid at 0x1a5b2b04940>

Large errors in prediction occur in rare instances. Same goes for low errors at high prices.

In [122]:
f, ax = plt.subplots(figsize=(10,5))
sns.boxplot(x="mm_cluster", y="percentage_error_selected", data=df1)
Out[122]:
<matplotlib.axes._subplots.AxesSubplot at 0x1a5a41d56d8>

7.2 Boxplots of Base Model Residuals & Wap Levels Against Amenities

Let's observe how wap and absolute residual values of base model varies across different categories for different amenities and specifications. 0 means no amenities are present in an amenity cluster while 1 implies at least one amenity is present in that cluster.

In [123]:
g = sns.catplot(x="am_cyber", y="wap", col="wap_level", data=df1, kind="box", height=5, aspect=.7,palette='Set3')
g = sns.catplot(x="am_cyber", y="residual", col="residual_abs_level", data=df1, kind="box", height=5, aspect=.7,palette='Set3')
In [124]:
g = sns.catplot(x="am_education", y="wap", col="wap_level", data=df1, kind="box", height=5, aspect=.7,palette='Set3')
g = sns.catplot(x="am_education", y="residual", col="residual_abs_level", data=df1, kind="box", height=5, aspect=.7,palette='Set3')
In [125]:
g = sns.catplot(x="am_finance", y="wap", col="wap_level", data=df1, kind="box", height=5, aspect=.7,palette='Set3')
g = sns.catplot(x="am_finance", y="residual", col="residual_abs_level", data=df1, kind="box", height=5, aspect=.7,palette='Set3')
In [126]:
g = sns.catplot(x="am_health_related", y="wap", col="wap_level", data=df1, kind="box", height=5, aspect=.7,palette='Set3')
g = sns.catplot(x="am_health_related", y="residual", col="residual_abs_level", data=df1, kind="box", height=5, aspect=.7,palette='Set3')
In [127]:
g = sns.catplot(x="am_security", y="wap", col="wap_level", data=df1, kind="box", height=5, aspect=.7,palette='Set3')
g = sns.catplot(x="am_security", y="residual", col="residual_abs_level", data=df1, kind="box", height=5, aspect=.7,palette='Set3')
In [128]:
g = sns.catplot(x="am_indoor_game", y="wap", col="wap_level", data=df1, kind="box", height=5, aspect=.7,palette='Set3')
g = sns.catplot(x="am_indoor_game", y="residual", col="residual_abs_level", data=df1, kind="box", height=5, aspect=.7,palette='Set3')

Weighted average price and price residuals vary the most for amenities falling under high-level category of residuals.

Amenities with low p-values as shown during chi-square test are showing significant variations in residuals and weighted average prices across different categories.

Weighted average prices are influenced by certain amenities more than others. The presence and absence of specifications also makes a difference to prices.

7.3 Boxplots of Base Model Residuals & Wap Levels Against Specifications

In [129]:
g = sns.catplot(x="sp_interior", y="wap", col="wap_level", data=df1, kind="box", height=5, aspect=.7,palette='Set3')
g = sns.catplot(x="sp_interior", y="residual", col="residual_abs_level", data=df1, kind="box", height=5, aspect=.7,palette='Set3')
In [130]:
g = sns.catplot(x="sp_main", y="wap", col="wap_level", data=df1, kind="box", height=5, aspect=.7,palette='Set3')
g = sns.catplot(x="sp_main", y="residual", col="residual_abs_level", data=df1, kind="box", height=5, aspect=.7,palette='Set3')
In [131]:
g = sns.catplot(x="sp_exterior", y="wap",col="wap_level", data=df1, kind="box", height=5, aspect=.7,palette='Set3')
g = sns.catplot(x="sp_exterior", y="residual",col="residual_abs_level", data=df1, kind="box", height=5, aspect=.7,palette='Set3')
In [132]:
g = sns.catplot(x="sp_points", y="wap", col="wap_level", data=df1, kind="box", height=5, aspect=.7,palette='Set3')
g = sns.catplot(x="sp_points", y="residual", col="residual_abs_level", data=df1, kind="box", height=5, aspect=.7,palette='Set3')
In [133]:
g = sns.catplot(x="sp_cement", y="wap", col="wap_level", data=df1, kind="box", height=5, aspect=.7,palette='Set3')
g = sns.catplot(x="sp_cement", y="residual", col="residual_abs_level", data=df1, kind="box", height=5, aspect=.7,palette='Set3')

Conclusion: Same as that for amenities. Certain specifications impact project prices more than others as seen from chi-square tests. The presence and absence of specifications also makes a difference to prices.

7.4 Seperate Models for High & Low Cutoff Prices

How do model performances vary for high and low prices?

In [134]:
low_cutoff = df1.loc[df1['wap'] <= 10000, :] # Low-Cutoff prices: wap<=10000
high_cutoff = df1.loc[df1['wap'] > 10000, :] # High-Cutoff prices: wap>10000
In [135]:
hc = smf.ols(formula='log_wap ~ max_size + C(mm_cluster) + bp1 + bp2 + bp3 + C(launch_year) + C(proj_bedrooms_clean) + C(unit_type) + C(am_cyber) + C(am_education) + C(am_finance) + C(am_security) + C(am_health_related) + C(am_indoor_game) + C(sp_interior) + C(sp_main) + C(sp_exterior) + C(sp_points) + C(sp_cement)', data=high_cutoff).fit()
print(hc.summary())
                            OLS Regression Results                            
==============================================================================
Dep. Variable:                log_wap   R-squared:                       0.756
Model:                            OLS   Adj. R-squared:                  0.746
Method:                 Least Squares   F-statistic:                     75.20
Date:                Mon, 11 Mar 2019   Prob (F-statistic):          3.97e-273
Time:                        02:03:49   Log-Likelihood:                 251.00
No. Observations:                1039   AIC:                            -418.0
Df Residuals:                     997   BIC:                            -210.3
Df Model:                          41                                         
Covariance Type:            nonrobust                                         
========================================================================================================
                                           coef    std err          t      P>|t|      [0.025      0.975]
--------------------------------------------------------------------------------------------------------
Intercept                                9.5370      0.074    129.190      0.000       9.392       9.682
C(mm_cluster)[T.1]                       0.8746      0.042     21.067      0.000       0.793       0.956
C(mm_cluster)[T.2]                       0.0030      0.197      0.015      0.988      -0.383       0.389
C(mm_cluster)[T.3]                       0.3661      0.026     13.845      0.000       0.314       0.418
C(mm_cluster)[T.4]                       0.9700      0.082     11.807      0.000       0.809       1.131
C(mm_cluster)[T.5]                       0.7714      0.036     21.685      0.000       0.702       0.841
C(mm_cluster)[T.6]                       0.1917      0.024      8.045      0.000       0.145       0.238
C(mm_cluster)[T.7]                       1.1725      0.058     20.298      0.000       1.059       1.286
C(mm_cluster)[T.8]                       0.9855      0.037     26.395      0.000       0.912       1.059
C(mm_cluster)[T.10]                     -0.0621      0.076     -0.816      0.415      -0.212       0.087
C(mm_cluster)[T.11]                      0.6751      0.040     16.815      0.000       0.596       0.754
C(mm_cluster)[T.12]                     -0.1120      0.072     -1.551      0.121      -0.254       0.030
C(mm_cluster)[T.13]                      0.4830      0.056      8.614      0.000       0.373       0.593
C(mm_cluster)[T.14]                      0.1328      0.027      4.849      0.000       0.079       0.187
C(mm_cluster)[T.15]                      0.9529      0.045     20.992      0.000       0.864       1.042
C(mm_cluster)[T.16]                      0.2972      0.025     12.089      0.000       0.249       0.345
C(mm_cluster)[T.17]                      0.4717      0.030     15.636      0.000       0.413       0.531
C(mm_cluster)[T.18]                      0.8184      0.077     10.626      0.000       0.667       0.970
C(mm_cluster)[T.19]                      0.0613      0.084      0.726      0.468      -0.104       0.227
C(launch_year)[T.2015]                   0.0449      0.017      2.619      0.009       0.011       0.078
C(launch_year)[T.2016]                   0.0805      0.018      4.551      0.000       0.046       0.115
C(launch_year)[T.2017]                   0.0792      0.019      4.097      0.000       0.041       0.117
C(launch_year)[T.2018]                   0.1077      0.035      3.118      0.002       0.040       0.176
C(proj_bedrooms_clean)[T.2 and more]     0.0002      0.019      0.013      0.990      -0.037       0.038
C(proj_bedrooms_clean)[T.3 and more]     0.0481      0.033      1.455      0.146      -0.017       0.113
C(proj_bedrooms_clean)[T.4 and more]     0.0743      0.070      1.055      0.292      -0.064       0.213
C(proj_bedrooms_clean)[T.6 and more]    -0.0604      0.116     -0.521      0.603      -0.288       0.167
C(unit_type)[T.Villa]                 1.614e-16   5.34e-17      3.024      0.003    5.67e-17    2.66e-16
C(am_cyber)[T.1]                        -0.1624      0.029     -5.544      0.000      -0.220      -0.105
C(am_education)[T.1]                     0.0453      0.016      2.846      0.005       0.014       0.076
C(am_finance)[T.1]                      -0.0370      0.028     -1.301      0.194      -0.093       0.019
C(am_security)[T.1]                      0.0088      0.014      0.617      0.537      -0.019       0.037
C(am_health_related)[T.1]               -0.0046      0.028     -0.162      0.871      -0.060       0.051
C(am_indoor_game)[T.1]                  -0.0284      0.016     -1.804      0.072      -0.059       0.002
C(sp_interior)[T.1]                     -0.0087      0.021     -0.426      0.671      -0.049       0.032
C(sp_main)[T.1]                         -0.0148      0.019     -0.768      0.443      -0.052       0.023
C(sp_exterior)[T.1]                      0.0220      0.021      1.039      0.299      -0.020       0.063
C(sp_points)[T.1]                       -0.0202      0.024     -0.856      0.392      -0.067       0.026
C(sp_cement)[T.1]                        0.1679      0.085      1.968      0.049       0.000       0.335
max_size                              1.527e-05    1.2e-05      1.269      0.205   -8.34e-06    3.89e-05
bp1                                     -0.2676      0.068     -3.914      0.000      -0.402      -0.133
bp2                                     -0.2281      0.064     -3.589      0.000      -0.353      -0.103
bp3                                     -0.1089      0.058     -1.862      0.063      -0.224       0.006
==============================================================================
Omnibus:                       46.214   Durbin-Watson:                   1.820
Prob(Omnibus):                  0.000   Jarque-Bera (JB):              135.557
Skew:                          -0.099   Prob(JB):                     3.67e-30
Kurtosis:                       4.759   Cond. No.                     1.19e+16
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The smallest eigenvalue is 1.48e-23. This might indicate that there are
strong multicollinearity problems or that the design matrix is singular.
In [136]:
residual_log (hc, high_cutoff, 'residual_log_hc', 'residual_hc')
median:   -0.004438692770129649
C:\Users\User\Anaconda3\envs\anirban\lib\site-packages\ipykernel_launcher.py:5: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """
C:\Users\User\Anaconda3\envs\anirban\lib\site-packages\ipykernel_launcher.py:6: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  
Out[136]:
(count    1.039000e+03
 mean     4.221062e-13
 std      1.901325e-01
 min     -1.074862e+00
 25%     -1.162498e-01
 50%     -4.438693e-03
 75%      1.136970e-01
 max      6.650892e-01
 Name: residual_log_hc, dtype: float64, None)
In [137]:
percentage_error_projects (high_cutoff, 'residual_hc','percentage_error_hc', 'percentage_error_interval_hc')
C:\Users\User\Anaconda3\envs\anirban\lib\site-packages\ipykernel_launcher.py:3: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  This is separate from the ipykernel package so we can avoid doing imports until
C:\Users\User\Anaconda3\envs\anirban\lib\site-packages\ipykernel_launcher.py:5: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """
Out[137]:
percentage_error_interval_hc project_number percentage_projects
0 (-inf, -30.0] 77 7.410972
1 (-30.0, -20.0] 78 7.507218
2 (-20.0, -10.0] 150 14.436959
3 (-10.0, 0.0] 228 21.944177
4 (0.0, 10.0] 225 21.655438
5 (10.0, 20.0] 167 16.073147
6 (20.0, 30.0] 77 7.410972
7 (30.0, inf] 37 3.561116
In [138]:
# Mean Absolute Percentage Error for high cut off price model residuals
np.absolute(high_cutoff['percentage_error_hc']).mean()
Out[138]:
14.724125928637218
In [139]:
lc = smf.ols(formula='log_wap ~ max_size + C(mm_cluster) + C(proj_bedrooms_clean) + bp1 + bp2 + bp3 + C(launch_year)+ C(unit_type) + C(am_cyber) + C(am_education) + C(am_finance) + C(am_security) + C(am_health_related) + C(am_indoor_game) + C(sp_interior) + C(sp_main) + C(sp_exterior) + C(sp_points) + C(sp_cement)', data=low_cutoff).fit()
print(lc.summary())
                            OLS Regression Results                            
==============================================================================
Dep. Variable:                log_wap   R-squared:                       0.759
Model:                            OLS   Adj. R-squared:                  0.754
Method:                 Least Squares   F-statistic:                     157.2
Date:                Mon, 11 Mar 2019   Prob (F-statistic):               0.00
Time:                        02:03:50   Log-Likelihood:                 379.87
No. Observations:                1582   AIC:                            -695.7
Df Residuals:                    1550   BIC:                            -524.0
Df Model:                          31                                         
Covariance Type:            nonrobust                                         
========================================================================================================
                                           coef    std err          t      P>|t|      [0.025      0.975]
--------------------------------------------------------------------------------------------------------
Intercept                                8.9639      0.172     51.983      0.000       8.626       9.302
C(mm_cluster)[T.2]                      -0.5548      0.021    -26.943      0.000      -0.595      -0.514
C(mm_cluster)[T.3]                      -0.0026      0.088     -0.029      0.977      -0.176       0.171
C(mm_cluster)[T.6]                       0.2129      0.067      3.180      0.002       0.082       0.344
C(mm_cluster)[T.9]                      -0.8872      0.020    -43.672      0.000      -0.927      -0.847
C(mm_cluster)[T.10]                     -0.0591      0.058     -1.012      0.312      -0.174       0.055
C(mm_cluster)[T.12]                     -0.1303      0.025     -5.206      0.000      -0.179      -0.081
C(mm_cluster)[T.14]                      0.1054      0.050      2.108      0.035       0.007       0.203
C(mm_cluster)[T.16]                      0.0937      0.113      0.829      0.407      -0.128       0.315
C(mm_cluster)[T.19]                     -0.3883      0.021    -18.820      0.000      -0.429      -0.348
C(proj_bedrooms_clean)[T.2 and more]     0.0004      0.024      0.019      0.985      -0.046       0.047
C(proj_bedrooms_clean)[T.3 and more]    -0.1041      0.072     -1.450      0.147      -0.245       0.037
C(proj_bedrooms_clean)[T.4 and more]     0.1244      0.251      0.496      0.620      -0.367       0.616
C(proj_bedrooms_clean)[T.6 and more] -4.535e-15   2.75e-14     -0.165      0.869   -5.84e-14    4.93e-14
C(launch_year)[T.2015]                   0.0216      0.014      1.582      0.114      -0.005       0.048
C(launch_year)[T.2016]                   0.0360      0.014      2.542      0.011       0.008       0.064
C(launch_year)[T.2017]                   0.0759      0.015      4.914      0.000       0.046       0.106
C(launch_year)[T.2018]                   0.0813      0.025      3.264      0.001       0.032       0.130
C(unit_type)[T.Villa]                   -0.0453      0.054     -0.839      0.402      -0.151       0.061
C(am_cyber)[T.1]                         0.0622      0.038      1.629      0.103      -0.013       0.137
C(am_education)[T.1]                     0.0270      0.013      2.034      0.042       0.001       0.053
C(am_finance)[T.1]                       0.0188      0.030      0.636      0.525      -0.039       0.077
C(am_security)[T.1]                      0.0202      0.012      1.634      0.103      -0.004       0.044
C(am_health_related)[T.1]                0.0104      0.030      0.349      0.727      -0.048       0.069
C(am_indoor_game)[T.1]                  -0.0112      0.013     -0.863      0.389      -0.037       0.014
C(sp_interior)[T.1]                      0.0151      0.016      0.945      0.345      -0.016       0.046
C(sp_main)[T.1]                         -0.0094      0.015     -0.630      0.529      -0.039       0.020
C(sp_exterior)[T.1]                     -0.0115      0.017     -0.690      0.490      -0.044       0.021
C(sp_points)[T.1]                        0.0089      0.015      0.575      0.566      -0.021       0.039
C(sp_cement)[T.1]                     1.027e-16   1.72e-17      5.961      0.000    6.89e-17    1.37e-16
max_size                              3.763e-05   1.71e-05      2.205      0.028    4.16e-06    7.11e-05
bp1                                     -0.1641      0.169     -0.970      0.332      -0.496       0.168
bp2                                     -0.0291      0.166     -0.175      0.861      -0.355       0.297
bp3                                      0.0893      0.168      0.530      0.596      -0.241       0.419
==============================================================================
Omnibus:                      144.094   Durbin-Watson:                   1.622
Prob(Omnibus):                  0.000   Jarque-Bera (JB):              770.948
Skew:                          -0.228   Prob(JB):                    3.90e-168
Kurtosis:                       6.389   Cond. No.                     1.14e+16
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The smallest eigenvalue is 8.69e-24. This might indicate that there are
strong multicollinearity problems or that the design matrix is singular.
In [140]:
residual_log (lc, low_cutoff, 'residual_log_lc', 'residual_lc') #Residual summary statistics for low cutoff price model
median:   0.002259362190613956
C:\Users\User\Anaconda3\envs\anirban\lib\site-packages\ipykernel_launcher.py:5: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """
C:\Users\User\Anaconda3\envs\anirban\lib\site-packages\ipykernel_launcher.py:6: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  
Out[140]:
(count    1.582000e+03
 mean     1.546783e-12
 std      1.903784e-01
 min     -1.316004e+00
 25%     -1.036744e-01
 50%      2.259362e-03
 75%      1.088615e-01
 max      8.701490e-01
 Name: residual_log_lc, dtype: float64, None)
In [141]:
percentage_error_projects (low_cutoff, 'residual_lc','percentage_error_lc', 'percentage_error_interval_lc')
C:\Users\User\Anaconda3\envs\anirban\lib\site-packages\ipykernel_launcher.py:3: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  This is separate from the ipykernel package so we can avoid doing imports until
C:\Users\User\Anaconda3\envs\anirban\lib\site-packages\ipykernel_launcher.py:5: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """
Out[141]:
percentage_error_interval_lc project_number percentage_projects
0 (-inf, -30.0] 114 7.206068
1 (-30.0, -20.0] 92 5.815424
2 (-20.0, -10.0] 220 13.906448
3 (-10.0, 0.0] 353 22.313527
4 (0.0, 10.0] 394 24.905183
5 (10.0, 20.0] 258 16.308470
6 (20.0, 30.0] 108 6.826802
7 (30.0, inf] 43 2.718078
In [142]:
# Mean Absolute Percentage Error of low cutoff price model residuals
np.absolute(low_cutoff['percentage_error_lc']).mean()
Out[142]:
14.33385286069046

The model for low cutoff prices is better in predicting weighted average prices. This shows that low priced projects can be better predicted with lesser errors than higher priced projects.

8. Random Forest Regressor

In [143]:
include=['log_wap','max_size','mm_cluster', 'bp1','bp2','bp3', 'unit_type','proj_bedrooms_clean','launch_year', 'am_security', 'am_indoor_game', 'am_cyber', 'am_finance', 'am_health_related','am_education', 'sp_interior', 'sp_main', 'sp_exterior', 'sp_points','sp_cement']
df2=pd.DataFrame(df1[include])

for i in ['unit_type','proj_bedrooms_clean','launch_year', 'am_security', 'am_indoor_game', 'am_cyber', 'am_finance', 'am_health_related', 'am_education', 'sp_interior', 'sp_main', 'sp_exterior', 'sp_points', 'sp_cement']:
    label_encoder(df2, i)
    type_converter (df2, i, 'int')

df2.head()
Out[143]:
log_wap max_size mm_cluster bp1 bp2 bp3 unit_type proj_bedrooms_clean launch_year am_security am_indoor_game am_cyber am_finance am_health_related am_education sp_interior sp_main sp_exterior sp_points sp_cement
0 9.579418 744.75 16 0.954955 0.045045 0.000000 0 0 1 0 0 0 0 0 0 0 0 0 0 0
1 9.545526 1003.30 16 0.000000 0.653333 0.346667 0 1 0 1 1 1 0 0 1 0 1 0 0 0
2 9.545526 1003.30 16 0.000000 0.653333 0.346667 0 1 0 1 1 1 0 0 1 0 1 0 0 0
3 9.461644 590.51 16 0.891667 0.108333 0.000000 0 0 2 0 0 0 0 0 0 0 0 0 0 0
4 9.739884 1717.00 16 0.030303 0.666667 0.303030 0 0 1 1 0 0 1 0 1 0 1 0 0 0
In [144]:
from sklearn.model_selection import train_test_split

target = 'log_wap'
features = df2.columns != target
x = df2.loc[:,features]
y = df2[target]
x_train, x_test, y_train, y_test = train_test_split(x, y, random_state=8)
print(x_train.shape, x_test.shape, y_train.shape, y_test.shape)
(1965, 19) (656, 19) (1965,) (656,)
In [145]:
from sklearn.ensemble import RandomForestRegressor

rf = RandomForestRegressor(n_estimators=2000, n_jobs=5, max_depth=12,random_state=4,criterion='mae')
rf.fit(x_train, y_train)
Out[145]:
RandomForestRegressor(bootstrap=True, criterion='mae', max_depth=12,
           max_features='auto', max_leaf_nodes=None,
           min_impurity_decrease=0.0, min_impurity_split=None,
           min_samples_leaf=1, min_samples_split=2,
           min_weight_fraction_leaf=0.0, n_estimators=2000, n_jobs=5,
           oob_score=False, random_state=4, verbose=0, warm_start=False)
In [146]:
def model_R2 (model,x_testdataframe,y_testdataframe):
    from sklearn.metrics import r2_score
    
    y_pred=model.predict(x_testdataframe)
    return (print("R2 Score -", r2_score(y_testdataframe, y_pred)))
In [147]:
print("Random Forest Regressor:") 
model_R2 (rf, x_test, y_test)
Random Forest Regressor:
R2 Score - 0.8866487863156991
In [148]:
def plot_feature_importances (model, kind, title, color, dataframe):
    importances = pd.Series(data=model.feature_importances_, index= dataframe.columns)
    # Sort importances
    importances_sorted = importances.sort_values()
    # Draw a horizontal barplot of importances_sorted
    importances_sorted.plot(kind=kind, color=color)
    plt.title(title)
    return(plt.show())
In [149]:
plot_feature_importances (rf, 'barh', 'Random Forest Feature Importances', 'maroon', x_train)
In [150]:
def compute_residual (model, column, dataframe,x_test_df,y_test_df):
    y_pred=model.predict(x_test_df)
    y_pred_wap=np.exp(y_pred)
    y_actual_wap=np.exp(y_test_df)
    y_resid = y_actual_wap - y_pred_wap
    dataframe[column]=pd.DataFrame(y_resid)
    return(dataframe[column].describe())
In [151]:
compute_residual (rf, 'residual_rf', df1, x_test, y_test)
Out[151]:
count      656.000000
mean       375.046851
std       3792.186828
min     -19278.754663
25%       -836.391724
50%        -18.459803
75%       1027.564563
max      26924.488906
Name: residual_rf, dtype: float64
In [152]:
percentage_error_projects (df1, 'residual_rf', 'percentage_error_rf', 'percentage_error_interval_rf')
Out[152]:
percentage_error_interval_rf project_number percentage_projects
0 (-inf, -30.0] 73 11.128049
1 (-30.0, -20.0] 47 7.164634
2 (-20.0, -10.0] 88 13.414634
3 (-10.0, 0.0] 126 19.207317
4 (0.0, 10.0] 131 19.969512
5 (10.0, 20.0] 92 14.024390
6 (20.0, 30.0] 54 8.231707
7 (30.0, inf] 45 6.859756
In [153]:
# Mean Absolute Percentage Error for Random Forest Regressor
print('MAPE of Random Forest Regressor:', np.absolute(df1['percentage_error_rf']).mean())
MAPE of Random Forest Regressor: 19.016171469747338
In [154]:
sns.FacetGrid(df1, col="wap_interval",margin_titles=True).map(plt.hist, "residual_rf")
Out[154]:
<seaborn.axisgrid.FacetGrid at 0x1a5b26b5278>
In [155]:
sns.relplot(x="wap", y="percentage_error_rf", hue="zone", data=df1)
sns.relplot(x="wap", y="percentage_error_rf", hue="residual_abs_level", data=df1)
sns.relplot(x="wap", y="percentage_error_rf", hue="wap_level", data=df1)
Out[155]:
<seaborn.axisgrid.FacetGrid at 0x1a5b0d962e8>

Errors are concentrated at smaller levels regardless of the categories they belong to. Due to relatively higher level of residuals, scatter plots each category are widely spread out compared to linear regression plot.

In [156]:
f, ax = plt.subplots(figsize=(10,5))
sns.boxplot(x="mm_cluster", y="percentage_error_rf", data=df1)
Out[156]:
<matplotlib.axes._subplots.AxesSubplot at 0x1a5b1c4d2e8>

The Random Forest regressor model is performing even more poorly, indicating greater degree of residuals. Percentage residual is starting to show greater variance across clusters.

9. Extreme Gradient Boost Regressor

In [157]:
from xgboost import XGBRegressor

xg_reg=XGBRegressor(objective='reg:linear', n_estimators=170, learning_rate=0.1, seed=100,base_score=0.1,reg_alpha=0.4,reg_lambda=0.3)
xg_reg.fit(x_train, y_train)
Out[157]:
XGBRegressor(base_score=0.1, booster='gbtree', colsample_bylevel=1,
       colsample_bytree=1, gamma=0, learning_rate=0.1, max_delta_step=0,
       max_depth=3, min_child_weight=1, missing=None, n_estimators=170,
       n_jobs=1, nthread=None, objective='reg:linear', random_state=0,
       reg_alpha=0.4, reg_lambda=0.3, scale_pos_weight=1, seed=100,
       silent=True, subsample=1)
In [158]:
from sklearn.model_selection import cross_validate

# cross validate
scores_xgreg = cross_validate(xg_reg, x_train, y_train, cv=8, return_train_score=True,return_estimator=False)
scores_xgreg =pd.DataFrame(scores_xgreg )
scores_xgreg 
Out[158]:
fit_time score_time test_score train_score
0 0.155107 0.0 0.896292 0.934562
1 0.134581 0.0 0.898020 0.934595
2 0.143283 0.0 0.908593 0.932432
3 0.141700 0.0 0.894524 0.934257
4 0.141576 0.0 0.913102 0.933907
5 0.141642 0.0 0.889054 0.935658
6 0.141720 0.0 0.896041 0.934170
7 0.141245 0.0 0.902474 0.934133
In [159]:
print('Mean Test Score for XG Boost Regressor: ', scores_xgreg ['test_score'].mean())
print('Mean Train Score for XG Boost Regressor: ', scores_xgreg ['train_score'].mean())
Mean Test Score for XG Boost Regressor:  0.8997623279299931
Mean Train Score for XG Boost Regressor:  0.9342141411349951
In [160]:
plot_feature_importances (xg_reg, 'barh', 'XG Boost Feature Importances', 'purple', x_train)

Just like random forest, micromarket cluster remains the favourite feature while max_size and bedroom by typologies continue to top the list as well. Feature map tells us different importance of predictor variables being considered.

In [161]:
print("XGBoost Regressor:") 
model_R2 (xg_reg, x_test, y_test)
XGBoost Regressor:
R2 Score - 0.90169804324625
In [162]:
compute_residual (xg_reg, 'residual_xgreg', df1, x_test, y_test)
Out[162]:
count      656.000000
mean       352.151611
std       3530.713007
min     -22406.589844
25%       -915.070679
50%        -41.671143
75%        828.120115
max      22366.789063
Name: residual_xgreg, dtype: float64
In [163]:
percentage_error_projects (df1, 'residual_xgreg', 'percentage_error_xgreg', 'percentage_error_interval_xgreg')
Out[163]:
percentage_error_interval_xgreg project_number percentage_projects
0 (-inf, -30.0] 71 10.823171
1 (-30.0, -20.0] 52 7.926829
2 (-20.0, -10.0] 96 14.634146
3 (-10.0, 0.0] 121 18.445122
4 (0.0, 10.0] 130 19.817073
5 (10.0, 20.0] 102 15.548780
6 (20.0, 30.0] 39 5.945122
7 (30.0, inf] 45 6.859756
In [164]:
# Mean Absolute Percentage Error for Extreme Gradient Boost Regressor
print('MAPE of Gradient Boost Regressor:', np.absolute(df1['percentage_error_xgreg']).mean())
MAPE of Gradient Boost Regressor: 18.33550318305244
In [165]:
sns.FacetGrid(df1, col="wap_interval",margin_titles=True).map(plt.hist, "residual_xgreg")
Out[165]:
<seaborn.axisgrid.FacetGrid at 0x1a5b4d418d0>
In [166]:
sns.relplot(x="wap", y="percentage_error_xgreg", hue="zone", data=df1)
sns.relplot(x="wap", y="percentage_error_xgreg", hue="wap_level", data=df1)
sns.relplot(x="wap", y="percentage_error_xgreg", hue="residual_abs_level", data=df1)
Out[166]:
<seaborn.axisgrid.FacetGrid at 0x1a5a4767b70>

Compared to Random Forest, errors are lower overall for XGBoost Regressor. However, it is unable to beat multiple linear regression model.

In [167]:
f, ax = plt.subplots(figsize=(10,5))
sns.boxplot(x="mm_cluster", y="percentage_error_xgreg", data=df1)
Out[167]:
<matplotlib.axes._subplots.AxesSubplot at 0x1a5b195e0f0>

10. Extreme Gradient Boost Regressor Using All Features

In [168]:
include=['log_wap','max_size','min_size','mm_cluster','bp1','bp2','bp3','bp4','bp5','bp6','bp7','bp8','b1','b2','b3','b4','b5','b6','b7','b8','unit_type','construction_status','proj_bedrooms_clean','launch_year']
amen=['am_water','am_security','am_indoor_game','am_outdoor_game','am_sewage','am_roads','am_cyber','am_power','am_religion','am_space','am_finance','am_firefighting','am_health_related','am_car_parking','am_education','am_maintenance','am_social_gatherings','am_lift','am_shop']
df3=pd.concat([df1[include],df1[amen],specifications],axis=1)

for i in list(df3.columns)[20:]:
    label_encoder(df3, i)
    type_converter (df3, i, 'int')
    
df3.head()
Out[168]:
log_wap max_size min_size mm_cluster bp1 bp2 bp3 bp4 bp5 bp6 ... sp_main sp_interior sp_switches sp_windows sp_wiring sp_exterior sp_frame sp_points sp_lobby sp_cement
0 9.579418 744.75 312.48 16 0.954955 0.045045 0.000000 0.0 0.0 0.0 ... 0 0 0 0 0 0 0 0 0 0
1 9.545526 1003.30 730.87 16 0.000000 0.653333 0.346667 0.0 0.0 0.0 ... 1 0 0 1 0 0 0 0 0 0
2 9.545526 1003.30 730.87 16 0.000000 0.653333 0.346667 0.0 0.0 0.0 ... 1 0 0 1 0 0 0 0 0 0
3 9.461644 590.51 269.96 16 0.891667 0.108333 0.000000 0.0 0.0 0.0 ... 0 0 0 0 0 0 0 0 0 0
4 9.739884 1717.00 681.57 16 0.030303 0.666667 0.303030 0.0 0.0 0.0 ... 1 0 0 1 0 0 0 0 0 0

5 rows × 60 columns

In [169]:
dependent = 'log_wap'
independent = df3.columns != dependent
x1 = df3.loc[:,independent]
y1 = df3[dependent]
x1_train, x1_test, y1_train, y1_test = train_test_split(x1, y1, random_state=8) # Split into train and test dataset

xg_reg_all=XGBRegressor(objective='reg:linear', n_estimators=200, learning_rate=0.095, seed=100,base_score=0,reg_alpha=0.3)
xg_reg_all.fit(x1_train, y1_train) # Fit Extreme Gradient Boost Regressor into  training data
Out[169]:
XGBRegressor(base_score=0, booster='gbtree', colsample_bylevel=1,
       colsample_bytree=1, gamma=0, learning_rate=0.095, max_delta_step=0,
       max_depth=3, min_child_weight=1, missing=None, n_estimators=200,
       n_jobs=1, nthread=None, objective='reg:linear', random_state=0,
       reg_alpha=0.3, reg_lambda=1, scale_pos_weight=1, seed=100,
       silent=True, subsample=1)
In [170]:
# cross validate
scores_xgreg_all = cross_validate(xg_reg_all, x1_train, y1_train, cv=8, return_train_score=True,return_estimator=False)
scores_xgreg_all = pd.DataFrame(scores_xgreg_all)
scores_xgreg_all 
Out[170]:
fit_time score_time test_score train_score
0 0.379293 0.000000 0.898788 0.943817
1 0.367638 0.000000 0.908572 0.941766
2 0.374624 0.000000 0.900874 0.943778
3 0.374612 0.000000 0.890956 0.941515
4 0.367580 0.000000 0.912512 0.942819
5 0.373473 0.000000 0.895845 0.942618
6 0.369247 0.003989 0.898088 0.943955
7 0.365841 0.000000 0.909452 0.942670
In [171]:
print('Mean Test Score for XG Boost Regressor using all features: ', scores_xgreg_all ['test_score'].mean())
print('Mean Train Score for XG Boost Regressor using all features: ', scores_xgreg_all ['train_score'].mean())
Mean Test Score for XG Boost Regressor using all features:  0.9018859017773793
Mean Train Score for XG Boost Regressor using all features:  0.9428671489651761
In [172]:
print("XGBoost Regressor for all features:") 
model_R2 (xg_reg_all, x1_test, y1_test)
XGBoost Regressor for all features:
R2 Score - 0.8936331785908977
In [173]:
fig,ax=plt.subplots(figsize=(10,12))
fig=plot_feature_importances (xg_reg_all, 'barh', 'XG Boost Feature Importances using all features', 'purple', x1_train)

Not all features are important to the XG Boost Regressor model. Most of the features in amenities and specifications were not even considered.

In [174]:
compute_residual (xg_reg_all, 'residual_xgreg_all', df1, x1_test, y1_test)
Out[174]:
count      656.000000
mean       308.060737
std       3708.009407
min     -24822.617188
25%       -834.762158
50%        -16.553281
75%        928.250150
max      23051.935547
Name: residual_xgreg_all, dtype: float64
In [175]:
# Error Distribution for Extreme Gradient Boost Regressor using all features
percentage_error_projects (df1, 'residual_xgreg_all', 'percentage_error_xgreg_all', 'percentage_error_interval_xgreg_all')
Out[175]:
percentage_error_interval_xgreg_all project_number percentage_projects
0 (-inf, -30.0] 76 11.585366
1 (-30.0, -20.0] 52 7.926829
2 (-20.0, -10.0] 82 12.500000
3 (-10.0, 0.0] 121 18.445122
4 (0.0, 10.0] 141 21.493902
5 (10.0, 20.0] 93 14.176829
6 (20.0, 30.0] 56 8.536585
7 (30.0, inf] 35 5.335366
In [176]:
# Mean Absolute Percentage Error for Extreme Gradient Boost Regressor using all features
print('MAPE of XGBoost Regressor using all features:', np.absolute(df1['percentage_error_xgreg_all']).mean())
MAPE of XGBoost Regressor using all features: 18.861447572632194

Errors are slightly higher when all the models are fed to the XG Boost Regressor model.

In [177]:
sns.FacetGrid(df1, col="wap_interval",margin_titles=True).map(plt.hist, "residual_xgreg_all")
Out[177]:
<seaborn.axisgrid.FacetGrid at 0x1a5b38c0be0>
In [178]:
sns.relplot(x="wap", y="percentage_error_xgreg_all", hue="zone", data=df1)
sns.relplot(x="wap", y="percentage_error_xgreg_all", hue="wap_level", data=df1)
sns.relplot(x="wap", y="percentage_error_xgreg_all", hue="residual_abs_level", data=df1)
Out[178]:
<seaborn.axisgrid.FacetGrid at 0x1a5b1c88b38>

Errors slightly larger than XG Boost model with selected amenities and specifications. Thus, it is important to be selective about the features we feed that can be good predictors.

In [179]:
f, ax = plt.subplots(figsize=(10,5))
sns.boxplot(x="mm_cluster", y="percentage_error_xgreg_all", data=df1)
Out[179]:
<matplotlib.axes._subplots.AxesSubplot at 0x1a5b4c52cf8>